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Retiming  of  Level-Clocked  Circuits 


Carl  Ebeling,  Brian  Lockyear 

Using  level-sensitive  latches  instead  of  edge-triggered  registers  lor  storage 
elements  in  a  synchronous  system  can  lead  to  faster  and  less  expensive  circuit 
implementations.  This  advantage  derives  from  an  increased  flexibility  in 
scheduling  the  computations  performed  by  the  circuit.  In  level-clocked 
circuits,  a  value  may  arrive  early  and  How  through  a  latch,  giving  the 
following  computation  more  time. 

Taking  full  performance  advantage  of  latches  requires  placing  them  in  the 
circuit  to  achieve  the  best  use  of  the  clock  cycle.  This  process  of  rearranging 
the  storage  elements  in  a  circuit  is  called  retiming  and  can  be  used  to  reduce 
the  cycle  time  or  the  number  of  storage  elements  without  changing  the 
interface  behavior  of  the  circuit  as  viewed  by  an  outside  host.  Retiming  in 
effect  reschedules  the  circuit  computations  in  time  based  on  the  length  t)l 
those  computations.  .-^n  efficient  method  for  retiming  circuits  that  use  edge- 
triggered  registers  has  been  described  by  Leiserson.  Rose  and  Saxe  iLR83. 
LS91].  In  essence,  this  method  uses  the  clock  period  as  the  bound  on  the  delay 
that  can  occur  on  a  path  in  the  circuit  with  no  registers. 

We  have  extended  these  retiming  techniques  to  level-clocked  circuits  by  first 
restricting  the  circuit  domain  to  "well-formed"  level-clocked  circuits.  In 
well-formed  circuits,  latches  occur  in  order  along  any  path  through  the 
circuit.  This  is  the  usual  style  of  multi-phase  circuit  design  and  provides 
retiming  with  maximum  flexibility  for  placing  latches.  We  then  define 
correctness  of  level-clocked  circuits  based  on  the  proper  flow  of  signals 
through  the  circuit.  That  is.  we  require  signals  departing  one  latch  to  arrive 
at  the  following  latch  during  the  next  clock  phase  for  that  latch. 

This  definition  of  correctness  leads  to  a  set  of  simple  path  delay  constraints 
and  cycle  delay  constraints.  The  definition  of  a  critical  path  between  any  two 
vertices  is  extended  using  these  constraints.  leading  to  constraints  on  the 
minimum  number  of  latches  on  each  critical  path  or  cycle.  These  constraints 
can  then  be  solved  for  any  given  clock  period  to  find  a  valid  retiming,  if  any. 
The  usual  search  for  the  optimal  clock  period  can  then  be  performed.  This 
technique  is  valid  for  clocks  with  any  number  of  phases  with  no  constraints 
on  phase  lengths  or  overlap  other  than  the  valid  clock  schedule  constraints. 

We  are  now  working  to  relax  some  of  the  restrictions  we  liavc  imposed, 
primarily  the  zero  minimum  delay  constraint  and  the  well-formed  circuit 
constraint. 

References: 

[LR83]  C.  Leiserson  and  F.  Rose  and  J.  Saxe,  "Optimizing  Synchronous  Circuitry 
by  Retiming,"  in  Proceedings  of  the  3rd  Caltech  Conference  on  VLSI.  .March, 
1983. 

[LS911  C.  Leiserson  and  J.  Saxe.  "Retiming  Synchronous  Circuitry.' 

Algorithmica,  Vol.  6.  No.  1.  pp.  5-35,  1991. 


2  Triptych:  A  New  Field-Programmable  Gate  Array 

Architecture 

Carl  Ebeling.  Gaetano  Borriello,  Scott  Hauck.  David  Song,  Elizabeth  Walkup 

Current  general-purpose  FPGAs  use  a  combination  of  programmable  logic 
blocks  and  programmable  interconnect  to  provide  a  gcnc:al  circuit 
implementation  structure.  This  clear  separation  between  logic  and 
interconnection  resources  is  attractive  because  the  mapping,  placement  and 
routing  decisions  are  decoupled.  The  price  for  this  separation  is  the  large  area 
and  delay  costs  incurred  for  the  flexible  interconnection  needeu  to  support 
arbitrary  routing  requirements.  This  leads  to  architectures  like  Xiiinx.  where 
the  routing  resources  consume  more  than  90%  of  the  chip  area. 

Domain-specific  FPGAs  like  the  Algotronix  CAL1024  and  Concurrent  Logic 
CFA6000  increase  the  chip  area  devoted  to  logic  by  providing  a  less  general, 
nearest  neighbor,  routing  structure  appropriate  for  structured  applications 
such  as  DSP  and  systolic  algorithms.  These  FPGA  architectures,  however,  are 
not  suitable  for  general  applications,  particularly  state  machines  and 
control  lers. 

We  have  designed  a  new  FPGA  architecture  called  Triptych  which  can 
efficiently  implement  both  structured  circuits  like  data  paths  and  more 
general  circuits  like  state  machines.  Triptych  differs  from  other  FPGAs  by 
matching  the  structure  of  the  logic  array  to  that  of  the  target  circuits,  rather 
than  providing  an  array  of  logic  cells  embedded  in  a  general  routing 
structure.  By  matching  the  physical  structure  to  the  logical  structure,  we 
reduce  the  amount  of  random  routing  that  is  otherwise  required.  As  shown  in 
Figure  1.  Triptych  proviucs  an  underlying  fanin/fanout  tree  structure  that 
matches  the  general  structure  of  multi-level  logic  DAGs. 


Figure  1.  The  overall  structure  of  the  Triptych  FPGA  shown  in  a  progression 
of  steps  highlighting  more  and  more  features. 
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This  basic  structure  is  augmented  with  segmented  routing  channels  between 
the  columns  that  facilitate  larger  fanout  structures  than  is  possible  in  the 

basic  array.  Finally,  two  copies  of  the  array,  flowing  in  opposite  directions, 
are  overlaid.  Connections  between  the  planes  exist  at  the  crossover  points  of 
the  short  diagonal  wires.  It  is  clear  that  this  array  does  not  allow  arbitrary 
point-to-point  routing  like  that  associated  with  Xilinx  and  .^ctel  FPGAs. 
However,  we  claim  that  this  array  matches  the  form  of  a  large  class  of  circuits, 
and  that  a  mapping  strategy  that  takes  this  structure  into  account  can  produce 

efficient  implementations. 

We  have  measured  the  potential  of  the  Triptych  architecture  relative  to  other 
reprogrammable  FPGAs  by  manually  mapping  a  range  of  interesting  circuits, 
including  structured  circuits  and  random  logic,  and  comparing  the  results 
with  respect  to  area  cost  and  circuit  speed.  This  comparison  shows  that 
Triptych  is  similar  in  cost  to  the  Algotronix  CAL1024  for  structured  circuits 

suited  to  the  CAL1024  and  superior  for  circuits  with  non-local  communication. 
Triptych  is  about  twice  as  efficient  as  Xilinx  across  the  whole  range  of  circuits. 
The  performance  of  Triptych  implementations  is  comparable  to  that  of  other 

FPGAs. 

Although  Triptych  shows  considerable  potential  as  an  efficient  FPGA 

architecture,  the  true  viable  of  Triptych  requires  tools  for  automatically 

mapping  circuits  to  its  structure.  We  are  now  working  on  a  set  of  mapping, 
placement  and  routing  tools  that  incorporate  information  about  the 
underlying  routing  structure  and  attempt  to  satisfy  routing  constraints  early 

in  the  placement  process.  Our  initial  goal  is  to  reach  break-even  point  with 
respect  to  Xilinx  which  occurs  at  about  30%  utilization.  That  is.  if  wc  can  place 
and  route  circuits  utilizing  more  than  30%  of  the  available  Triptych  logic, 

then  our  area  cost  will  be  less  than  Xilinx.  We  believe  that  50-75%  utilization  is 
ultimately  realizable. 

The  complete  text  of  a  paper  on  Triptych  that  was  present  ai  ihc  rcccm  Oxford 

FPGA  workshop  appears  as  an  appendix  to  this  report. 
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3  Subgraph  Isomorphism 

Carl  Ebeling 


We  have  been  working  on  a  "daughter  of  Gemini"  algorithm  [EB83.  EB88]  lor 
performing  fast  subgraph  isomorphism  for  circuit  graphs.  Such  an  algori’  m 
will  be  useful  for  automatically  identifying  components  in  L.  ge  VLSI  cir  s 
so  that  hierarchy  can  be  extracted  from  layout  and  perhaps  in  techno  .  _y 
mapping  for  identifying  possible  coverings  of  logic  graphs  by  libt.ry 
components. 

Our  algorithm  is  based  on  graph  coloring  and  operates  in  two  phases.  In  the 
first  phase,  the  subgraph  is  colored  .ch  that  the  color  of  each  node  is 

determined  only  by  the  internal  struct,.re  of  the  subgraph.  .-X  node  at  the 
center  of  the  subgraph  is  also  identified  as  the  keystone  node.  The  target 
graph  is  also  colored  such  that  the  keystone  nodes  of  all  subgraph  instances 

(and  possibly  other  nodes  as  well)  have  the  same  color  as  the  subgraph 

keystone  node. 

In  the  second  phase  of  the  algorithm,  each  possible  keystone  node  is  examined 

in  turn  to  verify  that  it  is  part  of  a  subgraph  instance.  This  is  performed  by 

coloring  from  the  keystone  node  in  both  the  subgraph  and  the  target  graph. 

Fortunately,  it  is  relatively  easy  to  show  that  nodes  outside  the  subgraph 

instance  can  be  kept  from  causing  spurious  colors  on  internal  nodes.  This 

coloring  provides  a  match  if  one  exists,  but  must  rely  in  difficult  cases  on 
backtracking. 

A  prototype  implementation  of  this  algorithm  has  been  completed  and 
preliminary  results  indicate  that  the  algorithm  works  well  for  practical 

circuits.  In  the  future  "e  will  be  modifying  some  of  the  data  structures  to 

optimize  the  performance  of  the  program,  and  measuring  the  performance  for 
large  graphs  and  difficult  subcircuits. 

References; 

[EB83]  C.  Ebeling  and  O.  Zajicek.  "Validating  VLSI  Circuit  Layout  by  Wirclisi 

Comparison."  in  Proceedings  of  ICC  AD,  pp.  172-173.  1983. 

[EB88]  C.  Ebeling,  "Gcminill:  A  Second  Generation  Layout  Validation  Program, 
in  Proceedings  of  ICC  AD,  pp.  322-325,  Nov.  1988. 
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Symbolic  Timing  Verification  and  High-Level  Synthesis 


Gaetano  Borriello.  Tod  Amon 

Symbolic  timing  verification  is  a  powerful  extension  to  traditional  constraint 
checking  that  allows  delays  and  constraints  to  be  expressed  as  symbolic 
variables.  The  verifier  then  determines  the  relationships  between  these 

parameters  based  on  known  propagation  delays  and  the  timing  constraints  to 

be  satisfied.  This  type  of  verification  provides  a  way  for  synthesis  tools  to 

derive  delay  constraints  on  internal  functions,  given  interface,  throughput, 
and  latency  constraints  provided  by  the  user.  We  have  developed  an  approach 
to  symbolic  timing  verification  using  constraint  logic  programming 
techniques.  The  techniques  are  quite  powerful  in  that  they  yield  not  only 

simple  bounds  on  delays  but  also  relate  the  delays  in  linear  inequalities  so  that 

tradeoffs  are  apparent.  We  model  circuits  as  communicating  processes  and  our 

current  implementation  can  verify  a  large  class  of  mixed  synchronous  and 
asynchronous  specifications. 

Symbolic  timing  verification  can  be  used  to  answer  many  other  questions 

about  a  design  specification  and  implementation.  If  circuit  delays  are  known, 

then  the  verifier  can  still  provide  an  answer  as  to  whether  or  not  a  particular 

implementation  meets  all  the  timing  constraints  in  the  specification  (as  in 
current  non-symbolic  verifiers).  More  generally,  however,  using  variables 
for  delays  can  provide  an  answer  about  the  range  of  values  assignable  to  that 
variable  that  will  still  meet  the  constraints.  This  can  be  valuable  information 
for  a  synthesis  tool  that  must  decide  how  to  implement  that  particular 

function.  At  low-levels  of  granularity  for  circuit  functions  it  may  lead  to  a 

different  logic  network  being  used  for  a  combinational  function.  At  a  high- 
level  it  may  lead  to  a  different  architecture  to  implement  a  computation  (c.g. 
more  or  less  parallel).  When  many  delays  are  represented  by  variables,  the 
answer  may  be  a  set  of  linear  inequalities  constraining  the  variables.  These 
relations  provide  synthesis  tools  with  information  about  tradeoffs  between 
circuit  delays  and  how  implementation  choices  for  circuit  functions  may 
affect  each  other. 

Therefore,  a  symbolic  timing  verifier  is  extremely  valuable  for  synthesis.  The 
information  it  produces  about  circuit  delays  can  be  used  to  determine  how 
much  time  is  available  for  a  sequence  of  operations  so  that  any  available  slack 

can  be  exploited  in  minimizing  the  circuitry  required.  .As  user-specified 
timing  constraints  change,  the  synthesis  process  may  lead  to  very  different 
circuit  implementations  that  cither  exploit  relaxed  constraints  to  minimize 
area  or  use  higher-performance  architectures  and  components  to  meet  tighter 
requirements.  The  utility  of  symbolic  timing  verification  can  be  further 
extended  if  we  consider  symbolic  timing  constraints.  In  this  case,  the 

verification  tool  serves  as  an  analysis  tool  that  can  determine  how  circuit 

delays  relate  to  the  symbolic  constraints.  For  example,  if  we  wish  to  determine 
the  maximum  throughput  of  a  circuit,  a  variable  can  be  used  on  the 
throughput  constraint  and  the  verifier  can  determine  its  range  of  values 

given  circuit  delays  and  other  constraints. 

It  is  these  uses  of  symbolic  timing  verification,  determining  delay  flexibility 
in  synthesis  and  how  it  affects  the  resulting  circuit  architecture,  that 
motivates  this  work.  The  information  obtainable  from  a  symbolic  verification 
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process  has  three  principal  uses:  (1)  implementation  verification.  (2) 
obtaining  constraints  for  synthesis.  and  (3)  design  evaluation. 
Implementation  verification  confirms  that  an  implementation  of  a  design  and 
its  associated  delays  will  meet  the  constraints  in  the  original  specification. 
Synthesis  tools  can  use  symbolic  delay  values  to  determine  the  degree  of 

flexibility  that  is  available  while  still  satisfying  the  timing  constraints.  This 

can  lead  to  much  more  efficient  use  of  resources  in  the  final  implementation. 
Design  evaluation  can  be  performed  by  using  sy  •  oolic  values  in  the 

constraints  and  determining  bounds  on  the  values  ol  these  variables  given 

circuit  delays.  This  can  provide  information  about  how  well  a  design  will 
perform  and  also  relate  the  constraint  variable  to  circuit  delays. 
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Synthesis  of  Microcontroller-Based  Embedded  Systems 


Gaetano  Borriello.  Pai  Chou.  Ross  Ortega 

Most  digital  design  involves  the  design  of  controllers  and  most  ol  the 
controllers  currently  on  the  market  usually  involve  an  embedded 
microcontroller  that  orchestrates  the  behavior  of  the  system.  However,  high- 

level  synthesis  tools  have  yet  to  target  this  type  of  low-cost,  yet  ubiquitous, 
implementation  medium  and  still  focus  on  custom  integrated  circuit 
implementations.  We  have  recently  begun  a  project  to  address  this  required 

change  in  emphasis. 

The  issue  is  even  more  urgent  as  integration  levels  are  increasing  to  the  point 
where  designers  of  these  control  systems  have  little  idea  what  to  do  ^^.iih  all 

the  devices  that  are  now  available  on  a  singie-dic  or  within  a  single  package. 

In  response  to  this,  highly  programmable  logic  arrays  are  being  marketed 
that  can  take  advantage  of  economies  of  scale  while  still  providing  hardware 

speeds  and  effectively  the  same  density  at  the  board  level  as  custom  logic. 
Field-programmable  gate  arrays  arc  not  the  only  instance  ol  this. 
Microcontrollers,  with  ever  increasing  options  for  their  communication  ports, 
are  becoming  extremely  commonplace.  There  are  even  cases  of  the  two  being 

integrated  onto  the  same  chip,  thus  allowing  an  entire  control  system  to  be 

implemented  in  a  single  device  that  includes  the  required  program  and  data 
memory . 

Our  project  is  to  study  all  aspects  of  synthesizing  to  these  types  of  devices. 
Problems  range  from  how  control -dominated  designs  are  specified,  to  how 

tradeoffs  can  be  made  as  to  what  functions  will  go  in  hardware  and  which  in 
microcontroller  software.  Our  approach  is  based  on  transformation  of 
communication  channels  between  the  parallel  processes  that  make  up  the 
system.  We  start  from  an  initial  specification  and  determine  where  a  cut  can 
be  made  between  processes,  separating  them  on  the  basis  of  hardware  or 
software  implementation.  The  position  of  the  cut  is  based  on  the 
communication  channels  crossing  it  and  whether  they  can  be  effectively  and 
efficiently  mapped  to  the  microcontroller's  I/O  ports  Ol  course, 

transformations  will  be  required  to  change  the  conimunicution  that  an 

optimized  cut  can  be  made.  Criteria  for  guiding  the  transformation  include 

width  and  speed  of  the  communication  channel  as  well  as  the  si/.e  of  the 

processes.  Bandwidth  and  speed  requirements  have  to  be  met.  Program  or 
hardware  size  may  be  under  constraints  related  to  the  details  of  the 

microcontroller  and  programmable  logic  being  used.  Tradeoffs  between  the 
two  media  are  necessary  to  get  designs  to  fit. 

Currently,  we  are  in  the  initial  phases  of  implementing  our  system.  We  arc 

using  a  subset  of  Verilog  with  some  specialized  macros  as  our  front-end  and 

plan  to  generate  Verilog  output  as  well.  The  output  processes  will  be  tagged  as 
being  hardware  or  software  and  then  sent  to  the  appropriate  kiw-icvcl 

synthesis  tools  for  final  mapping.  Most  of  our  efforts  arc  goinu  into  the 
development  of  a  library  of  transformations  for  communication  channels. 
Each  transformation  module  will  create  new  processes  of  cither  hardware  or 
software  to  replace  an  existing  communication  channel  and  improve  one  ol 

the  above  metrics.  Also,  we  are  investigating  parallelization  and 
sequentialization  algorithms  for  communicating  sequential  processes,  as  these 
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will  be  needed  lo  meet  constraints  or  map  multiple  processes  to 
microcontroller. 


single 
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6  Chaos  Router 


Kevin  Bolding.  Samson  Cheung,  Carl  Ebeling,  Larry  Snyder 

The  Chaos  Router  is  an  randomizing,  nonminimal  adaptive  packet  router  lor 
use  in  communication  systems  of  parallel  computers  [KS90.  KS911.  The  router 
is  deterministically  deadlock  free,  probabilisitically  livelock  free  and  well 
suited  for  k-ary,  d-cube  topologies.  The  present  research  thrust  is  to  compare 
chaotic  routing  with  other  routing  strategies,  c.g.  oblivious  routing  and 
deflection  routing,  to  understand  the  circumstances  under  which  it  is  superior 
to  competitors  and  to  estimate  quantitatively  the  amount  of  the  improvement. 
Additionally,  the  fault  tolerance  implications  of  the  router  are  being 
investigated.  [This  research  is  funded  bv  NSF  Grant  MlP-9013274  and  ONR 
Grant  N000f4-91-J- 1007.1 

Since  the  last  semiannual  report,  considerable  progress  has  been  achieved  in 
the  areas  of  performance  analysis  and  fault  tolerance. 

To  evaluate  the  performance  of  chaotic  routing  against  known  routing 
methods,  an  enormous  number  of  simulations  have  been  performed  for  a 
variety  of  topologies  (hypercubc.  mesh  and  torus),  a  variety  of  load 
characteristics  (uniform  random  and  4X  hot  spots)  and  for  a  number  of  system 
sizes  (64,  256  and  1024).  The  voluminous  data  generated  by  these  simulation 
cannot  be  easily  distilled  into  a  single  characterization.  However,  the 
accompanying  graph  gives  a  typical  summary.  Shown  below  are  results  for 
an  oblivious  router,  a  deflection  (hot  potato)  router,  and  chaotic  router. 


_ _ _  256-nodc  Torus  Latency  (Uniform  Traffic) 

600  f-  - -  chaos 

. -  oblivious 


•0  20  .30  40  50  60  70  80  00  1(K) 
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The  topology  is  a  torus  of  256  nodes.  The  plots  show  normalized  throughput 
(bisection  bandwidth  =  100%)  and  latency  (in  flit  times)  as  a  function  of 
presented  load.  The  results  show  that  chaotic  routing  with  shared  channels  is 
able  to  transmit  nearly  all  of  the  presented  load  and  to  have  the  minimum 
latency  among  the  different  routers  throughout  most  of  the  range.  Further 
results  for  2-dimensional  structures  arc  presented  in  lBS91a|. 

In  the  fault  Oerance  arena,  a  protocol  has  been  developed  to  implement 
system  level  fault  tolerance.  The  protocol  includes  detection  of  lost  or  blocked 
packets,  fault  diagnosis  (including  discovery  of  inoperative  channels  and 
processors),  fault  recovery  (including  a  limited  amount  of  system 
reconfiguration),  and  system  restart.  Detection  is  especially  interesting  in  the 
chaotic  router  because  there  is  no  "worst  case  time  for  delivery"  of  a  packet, 
and  thus  it  is  not  possible  to  use  the  usual  timeout  of  acknowledgment  failure 
to  identify  lost  packets.  Details  of  the  protocol  can  be  found  in  [BS9lb]. 
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Defect  and  Fault  Tolerance  in  \  LSf  Systems.  November  1991. 
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7  The  MacTester 


Carl  Ebeling,  Neil  MacKenzie.  Larry  McMurchie 

The  MacTester  is  a  low-cost  functional  tester  for  both  VLSI  chips  and  board- 

level  designs.  The  test  head  datapath  uses  Xilinx  FPGAs  and  provides  128  test 
signals,  each  of  which  can  be  dynamically  assigned  as  input  or  output.  The 

test  head  is  a  ZIF  socket  that  accommodates  a  variety  of  DIP'S  and  PGA’s  up  to 

132  pins  without  any  auxiliary  wiring,  with  the  possible  exception  of  power 

and  ground  for  large  chips.  The  MacTester  was  originally  designed  with  an 
interface  to  the  Mac  Nubus.  During  the  summer  of  '91,  we  completed  the 

design  of  an  interface  to  the  IBM  AT  bus. 

There  are  two  software  environments  for  running  the  MacTester.  One  consists 
of  writing  a  test  program  (using  any  of  ANSII  C  compilers)  and  linking  a 
library  of  low  level  tester  routines  that  set  and  observe  pins  of  the  DUT. 
Another  environment  is  the  DesignWorks  schematic  capture  and  simulation 
system  (from  Capilano  Computing),  where  the  DUT  is  represented  by  an  icon 
just  like  any  other  schematic  device.  In  this  way  one  can  add  circuitry  to  the 

schematic  drawing  that  sets  values  on  the  DUT  input  pins  and 

observes/checks  the  output  pins. 

Recently.  Applied  Precision  Inc.  of  Mercer  Island.  WA  has  indicated  they  will 

manufacture  and  sell  a  tester  based  on  the  MacTester  design.  .Availability  is 

scheduled  for  early  1992. 

Apple  Computer  provided  funds  and  equipment  for  the  design  of  the 
MacTester.  Funding  for  the  MacTester  software  development  was  provided  by 
an  Software  Capitalization  Grant  (NSF  Grant  #M1P-901 82241.  Through  MOSIS. 
DARPA  has  provided  a  means  of  fabricating  tester  PCBs. 


TRIPTYCH:  A  New  FPGA 
Architecture 

Carl  Ebcling,  Gaetano  Borricllo. 

Scott  A.  Hauck,  David  Song, 

Elizabeth  A.  Walkup 

Department  of  Computer  Science  &  Engineering 
University  of  Washington 
Seattle.  WA  98195 

Technical  Report  91-09-05 
September.  1991 


This  paper  appears  in  "FPGAs".  a  book  that  contains 
the  proceedings  of  the  Oxford  Workshop  on  Field 
Programmable  Logic.  September  1991. 


TRIPTYCH 


A  New  FPGA  Architecture 


Carl  Ebeling,  Gaetano  Borrieilo,  Scott  A.  Hauck, 
David  Song,  Elizabeth  A.  Walkup 

Department  of  Computer  Science  and  Engineering 
University  of  Washington 
Seattle,  WA  98195 


Abstract 

Existing  FPGA  architectures  can  be  classified  along  two  dimensions: 
reprogrammable  vs.  one-time  programmable  and  general-purpose  vs.  domain 
specific.  The  most  challenging  class  of  FPGA  architectures  to  design  is  the 
reprogrammable,  general-purpose  FPGA,  of  which  Xilinx  is  the  most  well- 
known  example.  In  this  paper  we  describe  Triptych,  a  new  FPGA  architecture 
that  addresses  two  problems  of  current  reprogrammable  FPGAs:  the  large 
delays  incurred  in  composing  large  functions  and  the  strict  division  between 
routing  and  logic  resources.  Our  studies  indicate  that  Triptych  is  more  area- 
efficient  than  current  architectures  and  has  comparable  delay  characteristics  for  a 
large  range  of  circuits  that  include  both  data-path  elements  and  control  logic. 


INTRODUCTION 

The  most  common  approach  to  fie  Id- programmable  gate  array  architectures  is  to  dedicate  a 
portion  of  the  total  chip  area  to  logic  functions  and  the  remainder  to  interconnection  resources. 
The  logic  functions  may  be  fixed  or  programmable,  while  the  routing  is  usually  highly 
programmable  to  ensure  that  a  large  percentage  of  designs  are  routable.  The  flexibility  of  the 
interconnection  network  is  limited  by  two  factors:  the  number  of  configuration  points  (bits  or 
fuses)  that  can  be  accommodated  on  chip  and  the  speed  requirements  of  the  signals  routed 
through  the  network  (more  switches  or  fuses  on  a  signal  path  implv  slower  wires)  (Rose 
1991). 

FPGAs  can  be  programmed  using  a  reprogrammable  memory-based  scheme  or  a  one-time 
programmable  fuse  technology.  Xilinx  is  the  most  well-known  example  of  a  reprogrammable 
FPGA  (Carter  1986).  It  has  logic  blocks  that  can  perform  arbitrary  functions  of  five  inputs. 
The  routing  resources  are  arranged  in  an  orthogonal  grid  around  the  function  blocks  and 
occupy  approximately  90%  of  the  chip  area.  Approximately  300  function  blocks  can  be  placed 
in  a  single  device,  the  number  being  limited  by  the  extra  routing  resources  additional  function 
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ceils  would  require.  In  a  chip  with  320  cells,  64,160  programming  bits  are  required,  or 
approximately  200  bits  per  cell  (Xilinx  1991). 

Among  one-time  programmable  FPGAs,  Actei  is  the  most  common  (El-Ayat  1989).  Actel 
arranges  a  basic  cell  in  rows  similar  to  an  arrangement  of  standard  cells  in  a  semi-custom 
integrated  circuit.  The  cell  functionality  is  fixed,  with  the  logic  function  determined  by  where 
inputs  are  connected  to  the  cell  (typical  usage  is  as  a  3-input  function).  The  interconnection 
resources  are  also  similar  to  the  standard  cell  style  with  wires  running  in  segmented  channels 
between  the  rows  of  cells  and  orthogonally  across  the  cells  to  provide  routing  in  all  four 
directions.  The  logic  cells  account  for  10-15%  of  the  chip  area  and  750,000  bits  are  required  to 
program  a  typical  chip  of  12(X)  ceils  (Actei  1991).  The  number  of  routing  tracks  limits  the  total 
number  of  cells  that  can  be  placed  with  reasonable  routability  on  a  single  chip. 

A  more  recent  entry  in  the  FPGA  arena  is  the  Apple  Labyrinth  architecture  (Funek  1990). 
Rather  than  dedicating  chip  area  to  either  computation  or  interconnect,  the  Labyrinth  FPGA 
tiles  the  chip  with  identical  small  cells  that  can  perform  either  2-input  functions  or  routing, 
depending  on  the  user-specified  programming  in  the  4  bits  per  cell.  Each  cell  is  connected  only 
to  its  four  nearest  neighbors.  The  design  is  intended  for  pipelined  bit-serial  applicadons, 
because  the  delays  incurred  in  routing  through  many  cells  severely  limit  the  cycle  time. 

In  this  paper,  we  present  an  alternative  structure  for  reprogrammable  FPGAs  that  blends 
logic  and  routing  resources  more  closely  than  most  other  FPGAs.  That  is,  each  routing  and 
logic  block  (RLE)  in  the  Triptych  array  can  be  used  both  to  compute  a  logic  function  and  route 
signals.  More  imponantly,  the  array  is  structured  to  match  the  inherent  fanin/fanout  tree 
structure  of  circuit  graphs.  This  allows  the  physical  layout  of  a  mapped  circuit  to  follow  its 
logical  structure,  reducing  the  need  for  extensive  routing  resources.  Circuits  use  varying 
numbers  of  RLBs  for  routing  depending  on  how  much  their  structure  diverges  from  the 
Triptych  structure. 

We  decided  to  undenake  the  detailed  design  and  implementation  of  Triptych  in  the  graduate 
VLSI  implementation  course  (CSE568)  in  the  winter  quaner  of  1991.  The  problem  was  an 
ideal  class  project  because  there  was  only  a  small  collection  of  basic  cells  to  design,  and 
students  could  work  on  implementation  and  mapping  issues  in  parallel.  This  paper  describes 
the  basic  Triptych  architecture  and  the  experience  we  gained  implementing  it  and  mapping 
circuits  to  it.  The  two  sections  following  this  introduction  describe  the  architecture  in  detail  and 
the  issues  and  design  choices  encountered  during  implementation.  The  next  section  provides  a 
first  look  at  how  the  architecture  can  be  used  and  how  it  compares  to  others,  as  well  as  some 
ideas  for  automatic  mapping.  Finally,  we  conclude  with  remarks  about  both  the  architecture 
and  the  educational  experience. 


TRIPTYCH 

The  FPGA  architecture  we  present  in  this  paper  differs  from  other  FPGAs  by  matching  the 
structure  of  the  logic  array  to  that  of  the  target  circuits,  rather  than  providing  an  array  of  logic 
cells  embedded  in  a  general  routing  structure.  By  matching  the  physical  structure  to  the  logical 
structure,  we  reduce  the  amount  of  “random”  routing  that  is  otherwise  required. 

Figure  1  shows  a  high-level  view  of  a  typical  multi-level  combinational  logic  circuit.  The 
flow  is  shown  as  unidirectional,  from  inputs  to  outputs.  From  the  point  of  view  of  each  input, 
the  data  flow  forms  a  fanout  tree  (shown  with  solid  arrows)  to  those  outputs  that  the  input 
affects.  From  the  point  of  view  of  each  output,  the  data  flow  forms  a  fanin  tree  (shown  with 
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dashed  arrows)  from  those  inputs  it  depends  upon.  It  is  this  fanin/fanoui  tree  form  that 
Triptych  emulates  architecturally  by  arranging  RLBs  into  columns,  with  each  RLB  having  a 
shon,  hard-wired  connection  to  its  nearest  neighbors  in  adjacent  columns  (see  Figure  2). 

The  basic  structure  is  augmented  with  segmented  routing  channels  between  the  columns 
that  facilitate  larger  fanout  structures  than  is  possible  in  the  basic  array.  Finally,  two  copies  of 
the  array,  flowing  in  opposite  directions,  are  overlaid.  Connections  between  the  planes  exist  at 
the  crossover  points  of  the  short  diagonal  wires.  It  is  clear  that  this  array  does  not  allow 
arbitrary  point-to-point  routing  like  that  associated  with  Xilinx  and  Actel  FPGAs.  However, 
we  claim  that  this  array  matches  the  form  of  a  large  class  of  circuits  and  that  mapping  will 
produce  routable  implementations. 


Figure  1  View  of  a  multi-level  combinational  logic  circuit  as  interleaved  fanin/fanout 
trees. 


Figure  2  The  overall  structure  of  the  Triptych  FPGA  shown  in  a  progression  of  steps 
highlighting  more  and  more  features.  The  basic  fanin/fanout  structure  on  the  left  is 
augmented  with  segmented  routing  channels  that  make  a  third  input  and  a  third  output 
available  to  the  RLBs.  The  structure  on  the  right  is  obtained  by  merging  two  copies  of 
the  middle  structure,  with  data  flowing  in  opposite  directions  in  the  two  copies.  Not 
shown  are  the  connection'^  between  the  two  copies,  which  permit  internal  feedback. 
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Each  RLB  in  the  array  has  three  inputs  and  three  outputs  and  may  perform  an  arbitrary 
logic  function  of  the  three  inputs,  with  the  result  optionally  held  by  a  master/slave  D-latch 
(Rose  1990).  Routing  in  the  Triptych  array  is  in  three  forms:  horizontally  through  the  RLBs 
(by  selecting  an  input  to  be  routed  to  an  output),  diagonally  through  short  wires  to  neighbors, 
and  verticaily  through  the  segmented  channels  between  columns  of  RLBs.  Only  one  input  and 
one  output  can  be  connected  to  the  venical  wires;  the  other  two  must  be  on  the  local  diagonal 
interconnect 

Circuits  can  be  mapped  onto  this  array  by  panitioning  the  logic  into  circuit  DAGs 
containing  nodes  with  at  most  three  inputs.  These  DAGs  arc  then  mapped  to  the  physical 
structure,  with  the  inputs  at  one  side  of  this  structure  and  the  outputs  generated  at  the  other. 
The  nodes  of  the  DAGs  are  placed  such  that  input  signals  are  available  from  the  neighbor  nodes 
or  along  a  vertical  connection.  As  Rose  suggests  in  (Singh  et  al  1990),  delay  can  be  minimized 
by  using  mostly  direct,  hard-wired  connections  for  the  critical  path.  Triptych  implementations 
do  not  strive  for  100%  logic  utilization.  Many  RLBs  will  be  used  to  provide  routing,  either  to 
fanout  a  signal  or  to  pass  it  forward  to  the  next  level.  Sometimes  a  mapping  will  leave  some 
cells  unused  to  achieve  a  routable  placement  of  nodes.  Examples  are  provided  below. 


Figure  3.  Triptych  RLB  design.  The  RLB  consists  of:  3  multiplexers  for  the  inputs, 
a  3-input  function  block,  a  master/slave  D-laich,  a  selector  for  the  latched  or  unlatched 
result  of  the  function,  and  3  multiplexers  for  the  outputs. 


RLB  structure 

A  logical  schematic  of  the  basic  Triptych  RLB  is  show  in  Figure  3.  As  can  be  seen,  the  cell  is 
designed  to  handle  both  function  calculation  and  signal  roudng  simultaneously  (hence  the  name 
routing  and  logic  block,  RLB).  It  takes  input  from  three  sources  and  feeds  them  into  a  function 
block  capable  of  computing  any  function  of  the  three  inputs,  and  the  output  can  then  be  used  in 
latched  or  unlatched  form.  The  RLB's  three  outputs  can  choose  from  any  of  the  three  inputs 
and  either  the  latched  or  unlatched  version  of  the  function  block  output  One  last  feature  is  the 
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loopback  from  the  master/slave  D-latch,  which  enables  the  function  to  be  dependent  on  its 
previous  value.  This  last  feature  is  included  for  state  machine  implementation,  although  it  may 
be  used  to  output  both  the  latched  and  unlatched  versions  of  the  function  block.  Again,  only 
one  of  the  inputs  and  one  of  the  outputs  can  be  connected  to  the  vertical  wires:  the  other  two  of 
each  type  are  connected  to  the  local  diagonal  wires. 

Typical  RLB  utilization 

A  Triptych  RLB  is  capable  of  performing  both  function  calculation  and  routing  tasks 
simultaneously,  which  leads  to  several  different  uses  of  the  RLB  (see  Figure  4).  The  three 
most  obvious  are:  (a)  a  routing  block  with  each  input  connected  to  one  of  the  outputs:  (b)  a 
splitter  with  one  of  the  inputs  going  to  two  or  three  of  the  outputs;  and  (c)  as  a  function 
calculator  with  the  three  inputs  going  to  the  function  block  and  the  function  going  out  the 
outputs.  However,  there  are  two  imponant  classes  of  hybrids  that  help  produce  more  compact 
designs.  The  first  comes  from  the  observation  that  in  blocks  used  to  calculate  a  three-input 
funrrion.  the  function  block  will  most  likely  not  go  out  all  three  outputs,  and  one  or  two  of  the 
input  signals  could  be  sent  out  the  unused  output  connection(s),  as  in  (d).  Secondly,  a 
function  of  two  inputs  can  be  implemented  by  making  the  function  insensitive  to  the  third 
input,  thus  allowing  the  unused  input  to  be  used  to  route  an  arbitrary  signal,  as  in  (e).  An 
imponant  observation  is  that  the  RLBs  will  never  need  to  be  used  for  one-input  functions  (i.e., 
an  invener),  since  any  output  signal  will  only  be  used  either  as  an  input  to  another  arbitrary 
function  block  where  the  invener  could  be  just  merged  into  the  function  computed,  or  to  an 
output  pin  where  an  optional  inversion  can  be  applied. 

As  was  shown  earlier,  the  Triptych  FPGA  has  no  generalized  interconnect  for  moving 
signals  horizontally.  Instead,  there  is  a  heavy  reliance  on  unused  RLBs  and  unused  portions  of 
RLBs  to  perform  these  routing  tasks. 


Figure  4  Five  typical  uses  of  Triptych  RLBs. 

Interconnection 

The  Triptych  RLBs  arc  connected  by  three  separate  interconnection  schemes.  The  first  is  for 
horizontal  interconnect  and  is  accomplished  through  the  RLBs  as  described  above.  The  second 
is  for  local  high-speed  communication  between  neighboring  RLBs  and  is  achieved  through 
“diagonals”.  The  detailed  structure  of  the  diagonals  is  shown  in  Figure  5.  They  allow  RLBs 
to  send  outputs  to  the  RLBs  immediately  above  and  below  them,  which  flow  in  the  opposite 
direction,  and  to  the  two  RLBs  in  the  same  position  in  the  next  column,  which  flow  in  the  same 
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direction.  Diagonals  are  imponant  for  two  reasons.  Diagonals  permit  the  construction  of 
multilevel  functions  of  more  than  three  inputs  without  the  speed  penalty  of  general-purpose 
interconnect.  They  also  allow  signal  flow  to  change  direction  both  so  that  circuits  can  be  more 
tightly  packed  and  feedback  can  be  provided  for  the  implementation  of  sequential  logic. 


Figure  5  Schematic  view  of  a  pair  of  diagonals  and  the  routing  combinations  they 
allow  (implemented  by  a  multiplexer  at  each  diagonal  input).  The  diagonals  connect  an 
RLB’s  outputs  to  the  RLB’s  four  nearest  neighbors:  two  directly  above  and  below  in 
the  same  column  and  the  two  in  the  same  positions  in  the  next  column. 


The  third  type  of  interconnect  is  used  for  longer  range  connections  and  large  fanout  nodes. 
It  is  implemented  as  a  set  of  segmented  “channel  wires”  between  adjacent  columns  (see  Figure 
6)  that  connect  middle  outputs  of  RLBs  to  the  middle  inputs  of  RLBs  flowing  in  the  same 
direction  in  the  next  column.  Needless  to  say,  this  flexibility  leads  to  a  slower  path,  and  speed- 
critical  designs  will  avoid  using  the  venical  channels  for  critical  paths.  There  are  7  tracks  in  a 
venical  channel,  with  6  handling  inter-cell  RLB  routing  and  a  seventh  to  carry  a  pin  input  The 
6  inter-cell  tracks  are  broken  up  into  two  tracks  each  of  8.  16,  and  32  RLB  high  segments. 


□□□□□□□□□□□□□□□□ 


Figure  6  .Top  half  of  a  segmented  channel  (on  its  side).  The  bottom  half  is  a  mirror 
image  of  the  top. 
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One  last  imponant  feature  of  the  interconnect  structure  is  how  it  handles  the  array  borders. 
Since  there  are  no  RLBs  beyond  the  right  and  left  edges  for  the  channel  wires  to  route  to,  the 
channels  on  the  edges  tie  the  two  directions  of  RLBs  together.  This  way  of  handling  the 
border  cases  leads  to  a  different  way  of  looking  at  the  array,  namely  as  a  cylinder  of  RLBs.  If 
the  diagonals  leading  to  the  opposite  direction  of  RLBs  were  cut  except  for  those  at  the  edges, 
the  chip  would  appear  to  be  a  folded  cylinder  of  RLBs.  In  fact,  it  is  often  helpful  to  think  of  the 
array  as  containing  many  smaller  cylinders.  For  example,  a  six  by  six  square  of  RLBs  can  be 
broken  off  from  the  rest  of  the  array  and  considered  to  be  a  cylinder  three  RLBs  high  and 
twelve  RLBs  in  circumference.  This  is  not  quite  true  because  the  vertical  channel  for  the  left 
and  right  edges  of  the  original  six  by  six  square  will  be  unusable  on  the  cylinder,  however  it 
can  be  a  useful  abstraction  for  hand  mapping.  In  fact,  the  Triptych  chip  is  an  array  of  64x8 
RLBs,  yielding  a  cylinder  of  32x16. 

Programming  bit  implementation  and  the  scan  path 

Triptych  is  a  RAM-based  reprogrammable  gate  array  with  26  memory  bits  per  RLB.  including 
those  bits  used  for  all  three  types  of  routing.  The  memory  cells  are  implemented  pseudo- 
statically  with  a  “hold”  signal  asserted  during  normal  operation  and  unassened  during 
programming.  We  found  that  this  gave  a  much  smaller  layout  than  a  fully  static  design 
(including  the  space  needed  for  this  extra  hold  line),  especially  when  it  was  realized  that  the 
hold  signal  was  necessary  for  selectively  disabling  RLB  output  drivers  during  programming. 
The  memory  cells  are  connected  by  a  scan  path  running  throughout  the  chip,  allowing  it  to  be 
programmed  by  cycling  data  through  the  bits. 

The  scan  path  used  for  programming  is  also  attached  to  the  RLB’s  master/slavc  D-latches. 
This  not  only  allows  the  chip  to  start  in  any  arbitrary  combination  of  latch  states,  but  it  also 
allows  the  contents  of  the  latches  to  be  shifted  out  after  the  chip  has  run  an  arbitrary  number  of 
cycles  to  facilitate  debugging.  Also,  if  the  scan  path  input  is  connected  to  the  output,  a 
programmed  circuit  can  be  stopped  at  any  point,  the  contents  of  the  D-latches  analyzed,  and  the 
circuit  resumed  at  the  previous  starting  point. 

Vital  statistics 

The  speed  of  a  path  in  a  Triptych  RLB  can  be  calculated  from  the  numbers  given  below  in 
Table  1.  For  example,  a  path  using  4  RLBs,  2  for  routing  and  2  for  function  calculation,  and  1 
channel  wire  would  take  13.9±0.6  nanoseconds  (4V1.6  +  2V2.2  +  3.1±0.6  =  13.9±0.6). 
Note  that  being  able  to  use  such  a  simple  speed  calculation  method  is  due  both  to  the  simplicity 
of  the  interconnect  and  also  to  the  design  philosophy  of  “independence  of  paths”  described 
below. 


Table  1  Speed  of  imponant  features,  estimated  using  HSPICE  with  parameters  for 
the  1.2mm  CMOS  n-well  process  available  from  MOSIS. 


Resource  Used 
RLB 

Function  Block 


_ Delay 

1.6ns 
additional  2.2ns 
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Channel  Wire 


2.5-3.7ns 


Table  2  Estimated  space  and  memory  utilization  per  RLB  of  various  features.  (Note: 
percentage  of  RLB  area  includes  area  for  memory  cells.) 


Percentage  of 
RLB  area 

Number 
of  bits 

Vertical  Segmented  Channels 

54% 

9 

Diagonals 

6% 

2 

Internal  Routing  &  Multiplexers 

23% 

7 

Function  Block  &  D-Latch 

17% 

8 

Total 

26 

Table  2  describes  the  relative  sizes  of  the  main  components  of  the  Triptych  RLBs.  The 
features  measured  are  “Vertical  Segmented  Channels”,  including  the  line  drivers  and  line 
readers;  “Diagonals”  which  includes  similar  features  as  the  “Vertical  Segmented  Channels”; 
“Internal  Routing  &  Multiplexers”  which  includes  the  three  4:1  multiplexers  for  selecting  the 
signal  to  send  to  each  output  as  well  as  the  2;  1  multiplexer  that  chooses  between  the  latched  and 
unlatched  function  block  output:  and  “Function  Block  &  D-Latch”.  Note  that  each  category  not 
only  includes  the  area  needed  for  the  given  functionality,  but  also  the  area  necessary  to  store  the 
configuration  bits  (which  contributes  1%  of  RLB  area  per  bit,  since  26  programming  bits  take 
up  26%  of  the  RLB  area). 

Probably  the  most  imponant  observation  to  be  made  from  the  table  is  that  83%  of  RLB 
area  is  devoted  to  routing  of  one  form  or  another,  with  the  actual  function  calculation  only 
occupying  17%.  Note  that  this  number  is  fairly  small  compared  to  other  reprogrammable 
FPGAs  since  a  full  30%  of  the  space  for  “Venical  Segmented  Channels”  is  actually  the 
inveners  and  tri-state  buffers  used  to  drive  the  channel  wires,  with  another  6%  in  associated 
memory  cells.  These  features  would  be  included  in  the  function  blocks  of  other  FPGAs. 


DESIGN  ISSUES 

The  design  and  implementation  of  the  Triptych  FPGA  brought  up  several  issues  that  we  feel 
are  of  general  interest.  These  are  discussed  in  the  following  subsections. 

Regularity 

A  goal  in  the  design  of  the  Triptych  cell  and  interconnect  was  to  achieve  as  regular  a  structure 
as  possible.  This  was  done  because  technology  mapping  is  difficult,  and  any  irregularities  only 
complicate  the  issue.  For  example,  the  Triptych  function  block  can  compute  any  function  of 
three  inputs,  as  opposed  to  designs  such  as  the  Actel  FPGA  where  only  a  subset  of  the 
possible  functions  can  actually  be  realized  in  a  cell.  Also,  an  arbitrary  function  block  removes 
the  worry  of  what  to  do  for  inversions,  since  an  invener  can  easy  be  (factored  into  any  or  all  of 
the  inputs  and  the  output  of  the  function  block. 
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The  inierconneci  scheme  follows  the  same  philosophy;  the  only  deviation  is  caused  by  the 
edges  of  the  array.  A  more  creadve  structure  with  the  interconnect  optimized  differently  (e.g., 
as  a  butterfly)  could  have  been  implemented,  but  we  feel  that  the  complications  added  to  the 
technology  mapping  stage  would  negate  any  potential  gains. 

Independence  of  paths  and  logical  effort 

The  Triptych  RLB  is  mostly  composed  of  multiplexers  and  bus  drivers.  Early  on,  the  decision 
had  to  be  made  whether  to  implement  most  of  the  multiplexers  with  switch  or  gate  logic.  Our 
original  choice  was  to  do  most  of  the  RLB  in  switch  logic  and  only  insen  inveners  where 
necessary  to  drive  loads.  We  have  since  decided  this  was  a  mistake  and  have  redesigned  the 
circuit  almost  completely  in  gate  logic.  The  main  reason  for  this  is  something  we  call 
■‘independence  of  paths”.  The  idea  is  that  the  routing  of  two  different  paths  should  affect  the 
timing  of  each  other  as  little  as  possible.  This  point  is  much  the  same  as  the  one  above,  except 
that  where  the  above  rule  dealt  with  the  logical  specification  of  the  RLB  and  the  interconnect, 
this  deals  with  how  the  RLBs  are  actually  implemented.  Take  for  example  the  case  where  a 
single  RLB  output  fans  out  to  several  inputs.  If  the  RLBs  were  implemented  in  switch  logic, 
'vnth  pass-gates  taking  inputs  off  the  venical  channels,  a  signal  would  propagate  more  slowly  if 
several  RLBs  were  reading  the  same  interconnect  line  than  if  each  had  its  own.  Thus,  a 
technology  mapper  designed  to  optimize  for  speed  would  have  to  make  sure  that  the  critical 
path  always  used  its  own  interconnect  line.  There  are  several  other  places  where  this  effect  can 
manifest  itself,  such  as  routing  an  input  to  an  output  (which  slows  down  the  function 
calculation)  and  splitting  a  signal  to  two  or  more  outputs  (which  slows  down  both  signals). 
This  rule  exists  not  just  to  make  technology  mapping  easier:  by  making  paths  independent,  it  is 
also  much  easier  to  optimize  the  RLB  channel  wire  drivers. 

The  Triptych  chip  was  onginally  laid  out  by  a  handful  of  graduate  students  with  little  or  no 
previous  integrated  circuit  design  expenence.  The  project  was  earned  on  by  one  of  these 
graduate  students  (Scott  Hauck),  who  did  a  completely  new  layout  aided  significantly  by  the 
model  of  logical  effon  (Sutherland  and  Sproull  1991)  which  assists  in  the  proper  sizing  of 
transistors  and  insenion  of  buffers  to  optimize  speed.  .Although  we  have  no  firm  numbers 
determining  how  much  bener  the  second  design  is  than  the  first,  we  feel  that  logical  effon  can 
help  novice  designers  develop  faster  circuits. 

Routing  flexibility 

There  are  several  unsettled  issues  in  the  design  of  the  Triptych  routing  network.  First  and 
foremost  is  the  sharing  of  tracks  in  the  venical  segmented  channels.  By  shanng  tracks  between 
RLBs  flowing  in  opposite  directions,  we  could  implement  a  more  flexible  feedback  capability 
than  is  possible  using  only  the  diagonals.  Currently,  the  array  has  seven  tracks  for  each 
direction,  for  a  total  of  14  in  each  segmented  channel.  One  alternative  is  to  have  5  tracks  for 
each  direction  with  another  2  shared  for  a  total  of  12.  It  is  difficult  to  tell  just  how  much 
sharing  is  needed.  The  shared  tracks  would  have  more  drivers  and  receivers  than  they  would  if 
they  were  not  shared  and  thus  be  slower.  More  experience  with  manual  and  automatic 
mapping  will  be  needed  before  this  issue  is  resolved. 

Another  issue  relates  to  the  D-latch  loopback  capability,  which  replaces  the  channel  wire 
input  in  RLBs  that  use  the  loopback.  .Most  likely,  this  input  will  be  needed  for  an  input  and 
conflict  with  the  use  of  the  loopback.  The  loopback  exists  because  it  was  extremely  cheap  to 
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include.  The  alternative  is  to  route  the  output  of  the  D-latch  around  through  the  RLB  above  or 
below  on  diagonals.  Whether  this  is  sufficient  or  a  form  of  internal  loopback  is  required 
(possibly  coming  in  on  diagonal  inputs)  will  also  be  determined  by  experimentation. 

Finally,  we  still  must  resolve  how  the  Triptych  array  will  be  connected  to  the  chip’s  I/O 
pins.  Inputs  can  reach  the  array  by  entering  RLBs  on  either  vertical  edge  or  by  entering  the 
vertical  channels  from  the  top  and  bonom.  We  expect  to  provide  general  input/output  pads  on 
all  four  sides  with  routing  channels  along  the  top  and  bottom  of  the  array.  Connections  with 
either  vertical  edge  will  be  more  direct  to  provide  a  fast  path  into  and  out  of  the  array  for  data¬ 
path  applications. 

Programming  hazards 

In  FPGA  design  it  is  very  tempting  to  ignore  the  programming  phase,  except  to  demand  the 
most  compact  implementation  of  the  programming  bits  as  possible.  However,  this  can  lead  to 
some  serious  problems.  In  an  FPGA,  there  are  certain  assumptions  made  about  the 
programming  that  are  not  enforced  by  the  hardware.  For  example,  it  is  assumed  that  at  most 
one  RLB  drives  any  specific  channel  wire.  This  is  automatically  enforced  in  intra-ceil  routing 
and  diagonals  by  vmue  of  multiplexer  logic.  In  the  case  of  channel  wires,  special  hardware  is 
required  to  enforce  this  constraint,  with  a  very  high  overhead.  In  the  Triptych  FPGA,  we 
simply  assume  that  the  software  performing  the  mapping  deals  with  this  problem  and  that  no 
configuration  loaded  will  violate  this  constraint.  During  programming,  however,  bits  stream 
through  the  programming  memory,  violating  this  programming  assumption.  This  leads  to 
shon-circuits  in  the  chip  and  possible  damage.  One  solution  is  to  adopt  a  bit-addressable 
scheme  for  the  programming  memory  rather  than  a  scan-path,  but  this  is  quite  expensive  due  to 
the  extra  routing  and  decode  logic  required.  Instead,  we  use  the  same  signal  that  enables  the 
scan-path  to  disable  all  channel  wire  drivers.  Thus,  while  the  chip  is  being  programmed  no 
drivers  are  active,  thereby  eliminating  the  problem.  This  costs  approximately  an  extra  3%  in 
chip  area  for  the  transistors  and  wiring  rcquu-ed. 


USING  TRIPTYCH 

In  this  section,  we  present  several  circuits  that  we  have  mapped  by  hand  to  Triptych.  The 
purpose  of  these  examples  is  to  demonstrate  the  constraints  on  routing  and  how  multilevel  logic 
circuits  do  indeed  map  to  the  physical  structure  provided  by  Triptych.  In  these  examples,  each 
RLB  is  shown  as  a  ceil  with  three  input  entries  and  three  output  entries.  Each  entry  indicates 
an  incoming  or  outgoing  signal.  .Note  that  each  block  may  create  a  new  signal  by  computing  a 
logic  function  over  the  inputs.  Diagonals  and  reverse  diagonals  that  are  used  in  the 
implementation  are  highlighted,  as  are  connections  to  the  channel  wires.  For  clarity,  only  those 
vertical  wires  carrying  signals  are  shown. 

8-bit  rotate  function 

The  power  of  using  columns  of  RLBs  for  rouung  only  is  shown  in  this  example  which  rotates 
a  set  of  8  bits  4  positions.  Each  level  can  be  used  to  send  one  signal  from  each  RLB  to  a 
neighbor  of  the  final  position.  Since  each  RLB  has  two  outputs,  one  intermediate  RLB  column 
and  two  vertical  channels  are  required  to  route  the  signals  to  their  final  destination  (see  Figure 
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7).  This  generalizes  to  the  case  where  three  signals  are  routed  per  RLB,  which  requires  two 
intermediate  RLB  columns  and  three  channels. 


Figure  7  Triptych  mapping  of  a  4-bit  rotate  of  8  bits. 


Generalizing  this  use  of  the  vertical  channels  suggests  a  naive  place  and  route  algorithm  that 
alternates  columns  of  RLBs  used  for  routing  with  columns  used  to  compute  logic  funcdons. 
Subject  to  a  sufficient  number  of  routing  tracks,  this  leads  to  a  viable  routing  of  arbitrary  logic 
functions.  However,  as  the  next  example  shows,  this  scheme  is  much  less  area-efficient  than 
is  generally  achievable. 

State  machine  example 

Figure  8  shows  the  factored  logic  equations  and  corresponding  Triptych  implementation  for  the 
ubiquitous  traffic  light  example.  This  example  shows  that  circuit  mappings  can  be  very 
compact  if  the  individual  logic  blocks  are  correctly  placed.  The  inputs  and  outputs  of  this 
circuit  are  all  connected  at  the  left  and  right  of  the  array,  except  for  three  signals  that  use  the  pin 
input  track  of  the  vertical  channels  (shown  dangling  off  the  bottom ).  In  this  example  16  RLBs 
are  used  to  compute  logic  functions,  2  RLBs  are  used  only  for  routing,  and  6  RLBs  are  left 
unused  (these  6  RLBs  must  be  counted  in  order  to  achieve  a  rectangular  mapping;  they  might 
be  used  in  neighboring  circuits).  Also,  this  circuit  is  assumed  to  be  placed  along  the  left  edge 
of  the  chip,  so  the  venical  tracks  at  that  edge  are  used  to  connect  RLBs  in  the  same  column. 
Note  that  this  example  would  have  been  easier  to  map  if  the  vertical  wires  could  be  used  to 
route  within  a  column  anywhere  in  the  chip,  not  just  at  the  borders,  and  in  fact  such  an 
extension  is  under  consideration.  This  is  about  as  compact  a  Triptych  layout  as  can  be 
expected  for  a  random  logic  function. 


12  Triptych:  A  New  FPGA  Architecture 


INORDER  =  si  s2  dl  st  SBO  SBl; 

OUTORDER  =  NSBO  NSBl  rl  yi  gl  r2  y2  g2  sd; 


!SB0 


NSBl 

=  ! St  ’  !gl 

*  1  g2  ; 

yl  = 

r2  *  51; 

gl  = 

r2  *  !51; 

r2  = 

1st  *  SBO  * 

! 9  +  ! St  • 

y2  = 

53  +  45; 

g2  = 

1st  *  ! r2  * 

!y2; 

sd  = 

12  +  45; 

51  = 

s2  *  !SB1  + 

!SB0  •  SBl; 

9  -  ! 

'SBl  +  !dl; 

45  = 

si  *  !SB1  * 

46; 

52  = 

!dl  *  SBl 

53  = 

52  *  46; 

12  = 

!SB1  *  18; 

NSBO 

=  !st  *  !r2i 

46  = 

1st  «  SBO; 

18  = 

!SB0  »  s2  • 

'  St ; 

ni 

Elil 

\WSM 

BlJI 

|ki 

1^1 

Ear- 

■rm 

cast 

iracSiii 

I 

Figure  8  Factored  equations  and  Triptych  realization  of  the  traffic  light  controller. 


Lyon  bit'Serial  multiplier 

Although  our  experience  shows  that  Triptych  can  be  used  to  implement  a  wide  range  of 
circuits,  its  locally  connected  structure  makes  it  especially  good  for  repetitive  arrays  like  bit- 
serial  arithmetic  circuits.  The  Triptych  strucrare  has  some  of  the  same  features  (e.g.,  nearest 
neighbor  connections)  as  the  Labyrinth  FPGA  which  was  targeted  to  bit-serial  and 
pipelined/systolic  circuits.  We  have  chosen  the  Lyon  bit-serial  multiplier  ceil  (Lyon  1981) 
shown  in  Figure  9  as  a  representative  circuit  from  this  class.  A  full  n-bit  multiplier  comprises  n 
copies  of  this  cell,  and  signal  processing  circuits  typically  make  use  of  several  of  these 
multipliers,  containing  many  individual  cells. 
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Figure  9  Design  of  a  single  Lyon  bit-serial  multiplier  ceil. 
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Figure  10  Layout  for  the  Lyon  bit-serial  multiplier  cell. 


This  multiplier  cell  presents  the  same  classic  layout  problem  as  that  faced  by  VLSI  ceil 
designers.  The  cells  need  to  tile  horizontally  so  that  inputs  match  outputs  and  vertically  so  that 
little  space  is  wasted  between  adjacent  multiplier  cells  (see  Figure  10).  In  this  case,  however, 
there  is  an  extra  dimension  since  a  string  of  multiplier  cells  will  wrap  around  the  chip  on  the 
opposite  direction  of  RLBs.  Since  there  is  one  RLB  that  is  used  from  the  opposite  direction 
(position  E-2),  the  layout  must  provide  a  “hole”  into  which  this  RLB  can  fit  (position  B-4), 
Note  that  these  two  logical  RLBs  can  share  a  single  physical  RLB  since  they  use  independent 
paths  through  the  RLB.  The  cost  of  this  multiplier  cell  design  is  12.5  RLBs  which  is  not  much 
more  than  the  smallest  conceivable  design,  which  costs  1 1  RLBs.  The  0.5  RLB  results  from 
the  sharing  of  one  RLB  (positions  A-3  and  F-4)  between  two  verrically  adjacent  multiplier 
cells. 

Measurement  and  comparison 

Although  our  experience  with  mapping  circuits  to  Triptych  is  thus  far  very  limited  since 
automated  placement  and  routing  are  still  being  developed,  we  have  some  preliminary 
measurements  of  the  cost  of  Triptych  implementations  relative  to  Labyrinth  and  Xilinx.  Since 
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the  area  cost  is  measured  for  each  FPGA  type  in  terms  of  the  number  of  logic  blocks  used  for 
that  technology,  we  must  first  normalize  the  cost  of  the  different  FPGA  logic  blocks  to  be  able 
to  compare  the  different  FPGAs.  Although  such  relative  figures  are  difficult  to  come  by,  we 
have  combined  a  relative  size  estimate  based  on  die  size  and  number  of  logic  blocks,  along  with 
the  relative  number  of  program  bits  to  arrive  at  the  following  relative  cost  figures.  Using  the 
cost  of  the  Labyrinth  logic  block  as  the  normalized  unit  cost,  we  estimate  that  the  cost  of  a 
Triptych  RLB  is  about  4-5  (4.5)  units  and  that  of  a  Xilinx  CLB  (configurable  logic  block)  is 
about  20-25  (22)  units.  This  places  the  Triptych  logic  ceil  squarely  in  the  middle  between  the 
very  cheap  Labyrinth  ceil  and  the  relatively  expensive  Xilinx  cell. 

Table  3  gives  the  approximate  cost  of  implementing  a  number  of  circuits  using  all  three 
ETGAs,  both  in  terms  of  each  technology’s  logic  blocks  and  in  normalized  cost  as  defined 
above.  We  believe  these  figures  indicate  that  Triptych  is  a  promising  architecture  for  a  range  of 
different  circuits.  These  results  are  of  course  very  preliminary  and  many  more  experiments 
must  be  done  with  other  circuits  and  using  automatic  place  and  route  tools. 


Circuit 

Labyrinth 
#  blocks 

normalized 

cost 

Xilinx 
#  blocks 

normalized 

cost 

Triptych 
#  blocks 

normalized 

cost 

Multiplier 

■BH 

5 

110 

12.5 

56 

Traffic 

6 

132 

24 

108 

s208 

N/A 

N/A 

26 

572 

61 

275 

Table  3.  Results  of  mapping  three  examples:  the  Lyon  bit-serial  multiplier,  a  traffic 
light  controller,  and  ISCAS  benchmark  s208  the  Labyrinth,  Xilinx  and  Triptych. 


Issues  in  mapping  to  Triptych 

We  have  successfully  mapped  a  number  of  regular  structures  and  small  control  circuits  to  the 
Triptych  architecture,  and  we  are  currently  working  on  CAD  tools  that  will  automatically 
perform  the  mapping  for  arbitrary  circuits.  As  with  other  FPGAs.  the  process  of  mapping  a 
circuit  onto  Triptych  can  be  coIJsidered  to  consist  of  three  steps: 

•  covering:  forming  a  circuit  graph  containing  function  nodes  with  at  most  three  inputs, 

•  placement:  assigning  these  function  nodes  to  ceil  locations  on  Triptych,  and 

•  routing:  making  the  connections  in  the  graph  through  the  available  routing  on  Triptych. 
If  the  circuit  to  be  mapoed  has  a  regular  structure,  as  is  encountered  in  domain-specific 
applications  such  as  digit^  signal  processing,  an  initial  pattern  for  the  repeating  portion  may  be 
derived  by  hand.  Circuits  without  regular  structure,  or  "random  logic”,  must  rely  on  heuristic- 
based  automatic  placement  and  routing  methods  similar  to  those  used  by  other  FPGAs. 
However,  because  Triptych’s  routing  resources  are  highly  constrained,  placement  and  routing 
must  be  more  closely  integrated  than  they  are  in  other  FPGAs. 

For  the  covering  portion  of  mapping  to  Triptych,  we  assume  that  a  tool  such  as  chortle  or 
mis-pga  is  available  to  express  the  original  circuit  as  a  graph  of  elementary  gates  and  then  covct 
the  graph’s  fanout-free  trees  with  collections  of  three-input  RLBs  (Francis  1991,  Murgai 
1990).  It  should  be  noted,  however,  that  a  covering  which  minimizes  the  total  number  of 
RLBs  may  not  be  optimal  when  placement  and  routing  arc  taken  into  consideration.  For 
example,  if  after  placement  two  of  the  inputs  to  a  three-input  RLB  naturally  both  occur  at  a 
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single  location  distant  from  that  RLB.  it  is  usually  advantageous  to  split  the  RLB  into  two  two- 
input  functions.  If  this  is  possible,  we  can  route  one  less  signal  across  the  large  distance. 
Clearly,  such  situations  are  not  unique  to  Triptych.  However,  we  panicularly  wish  to  avoid 
routing  extra  signals  horizontally  whenever  it  can  be  avoided.  Otherwise,  RLBs  become 
congested  with  signals  they  do  not  use.  Such  optimizations  are  difficult  to  predict  at  cover  time 
and  thus  need  to  be  attempted  during  routing. 

Because  Triptych’s  routing  resources  are  limited  and  fairly  tightly  constrained,  we  believe  it 
is  necessary  to  keep  placement  and  routing  well  integrated.  Evaluating  possible  placements 
with  simple  measures  of  routing  length  can  lead  to  placements  whose  congestion  make  routing 
nearly  impossible.  Currently,  we  are  exploring  iterative  improvement  methods  for  placement 
which  will  assign  an  RLB  only  into  locations  which  are  adjacent  to  enough  free  tracks  to  route 
the  RLB’s  inputs  and  outputs.  Thus,  we  avoid  congestion  at  a  local  level  whenever  we  place 
an  RLB. 

A  complicating  factor  is  that  Triptych’s  distance  metric  is  non-symmetric.  All  pairs  of 
RLBs  that  face  in  the  same  direction,  except  those  in  the  same  column,  have  a  distance  from  die 
first’s  output  to  the  second’s  input  different  than  that  of  the  second’s  output  to  the  first’s  input 
Also,  vertically  adjacent  blocks  have  the  same  routing  distance  as  diagonally  adjacent  blocks. 
For  these  reasons,  routing  distance  is  not  well  represented  by  the  x-y  coordinates  given  to  the 
RLBs.  Work  is  ongoing  to  develop  an  integrated  force-directed  placement  procedure,  a 
Triptych-specific  distance  measure,  and  the  congestion  avoiding  method  mentioned  above. 


CONCLUSIONS 

The  new  FPGA  architecture  presented  in  this  paper  was  motivated  by  three  needs:  permitting 
the  realization  of  delay-critical  circuits:  including  data-path  and  control  elements  in  the  same 
array;  and  minimizing  the  space  devoted  to  routing  resources.  We  believe  that  Triptych 
achieves  these  goals  given  the  experience  gained  so  far  with  many  example  circuits;  a  few  of 
which  have  been  presented  above.  The  examples  have  proven  to  be  more  densely  packed  and 
to  have  delay  characteristics  comparable  to  the  other  FPGAs. 

The  most  interesting  and  challenging  future  direction  for  research  is  automatic  mapping. 
Triptych  requires  that  the  functional  and  interconnect  elements  not  be  treated  separately. 
Combining  the  considerations  for  covering,  placement,  and  routing  should  allow  us  to  develop 
mapping  tools  that  more  precisely  predict  the  performance  of  circuits  and  more  accurately  trade 
off  density  for  speed. 

Pedagogically,  the  design  of  a  field-programmable  gate  array  made  an  excellent  class 
project  It  exposed  our  students  to  a  large  vertical  slice  of  the  design  problem  stretching  from 
electrical  details  to  technology  mapping  issues.  They  were  able  to  experience  many  of  the 
issues  and  tradeoffs  that  must  be  resolved  in  both  integrated  circuit  and  logic  design. 
Furthermore,  the  design  and  layout  were  ideal  for  a  class  pro’‘?ct  because  the  work  was  easily 
panitioned  and  only  a  small  number  of  cells  needed  to  be  considered.  In  this  respect,  the 
Triptych  effon  was  a  resounding  success  and  has  motivated  several  of  the  non- VLSI.  non- 
CAD  oriented  students  to  continue  to  look  into  VLSI  issues. 

In  summary,  we  learned  much  from  the  design  experience  and  believe  we  have  a  viable 
new  FPGA  architecture  for  circuits  where  either  minimization  of  delay  is  of  critical  importance 
and/or  data-path  elements  must  be  included  with  control  logic.  There  is  much  work  to  be  done. 
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especially  in  the  area  of  automatic  mapping,  and  promising  directions  are  just  beginning  to  be 
pursued. 
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Abstract  Digital  circuit  behavior  consists  of  two 
components:  functionality  and  timing.  Most  computer-aided 
design  tools  focus  on  functional  aspects  of  behavior 
emphasizing  data  transformation  and  sequencing  of  operations. 
Timing  aspects  of  behavior  have  received  far  less  attention  and 
have  only  recently  come  to  the  fore  as  concerns  for  system 
level  simulation,  system  integration,  and  synthesis  are 
becoming  more  acute.  In  this  paper,  we  present  a  behavior 
simulator,  called  OEsim,  that  ad^esses  these  issues.  The 
underlying  model  is  based  on  OEgraphs  and  supports  the 
specification  of  timing  constraints  at  many  levels  of 
abstraction,  from  propagation  delays  to  interface  behavior. 
The  simulator  is  an  ideal  design  validation  tool  in  that  it 
supports  incrememal  checking  of  timing  constraints  during 
simulation  and  minimizes  the  description  of  circuit  details 
unnecessary  to  timing  simulation.  The  key  ideas  are  the  use  of 
causality  for  the  specification  of  abstract  constraints  through 
the  use  of  a  restricted  first-order  predicate  calculus  that  has  a 
clear  simulation  semantics  and  an  efficient  realization  in  the 
simulator.  OEsim  has  been  implemented  in  C++  and  constructs 
a  compiled  simulation  from  an  OEgraph  representation. 

1.  Introduction 

Design  represenutions  capture  many  facets  of  digiul 
circuit  specifications.  Circuit  behavior  is  the  high-level 
description  of  what  a  circuit  does  without  overly  specifying 
how  that  computation  is  performed.  Circuit  structure  is  the 
low-level  description  of  how  the  computation  is  implemented, 
that  is.  the  logic  gates  and  flip-flops  used.  Functional  aspects 
of  both  behavior  and  structure  describe  the  data  transformations 
and  computations  to  be  performed  on  the  inputs  in  order  to 
generate  the  outputs.  In  contrast,  timing  relationships  for 
behavior  and  structure  describe  temporal  properties  such  as 
minimum  and  maximum  separation  times  for  signal  events. 
Figure  1  maps  out  the  space  of  design  representation 
schematically. 

It  is  of  critical  importance  that  a  design  representation 
sui^rt  user  validation.  Making  sure  that  what  was  specified  is 
what  was  desired  is  the  first  step  in  verifying  a  design  and 
cannot  be  automated.  A  simulator  provides  the  user  with  the 
capability  to  try  out  the  circuit  and  make  sure  it  behaves  as 
expected  (at  least  for  a  subset  of  all  possible  inputs).  The 
ability  to  simulate  the  specification  at  any  point  in  the  design 
space  is  also  crucial  as  designs  may  consist  of  both  behavioral 
and  structural  components.  Existing  simulators  focus 
primarily  on  function^  or  structural  aspects  and  include  little 
support  for  the  simulation  of  timing  behavior. 
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Figure  1.  The  design  representation  space  is  divided  betwee 
behavior  and  structure  and  orthogonally  between  functional  an 
timing  aspects.  Examples  in  each  of  the  four  quadrani 
demonstrate  the  distinctions. 


Detailed  timing  behavior  is  an  important  part  of  higl 
level  specification,  simulation  and  synthesis.  Whe 
synthesizing  a  circuit  not  all  aspects  of  its  behavior  are  undi 
the  control  of  the  designer.  That  is,  the  circuit  must  conform  t 
the  environment  in  which  it  will  be  placed.  The  environmer 
may  demand  that  particular  timing  relationships  be  respectei 
These  can  be  as  simple  as  setup  and  hold  times  or  as  complex  t 
the  spacing  between  messages  sent  over  a  local  area  networl 
Within  a  circuit,  we  must  be  able  to  adequately  describe  th 
dming  methodologies  used  to  implement  different  parts  of  th 
circuit.  These  include  precharging  constraints,  pipelin 
interlocks,  and  multiple-phase  clocks.  Also,  we  must  be  abl 
to  provide  an  abstraction  of  a  circuit  based  on  its  interfac 
behavior,  a  crucial  capability  for  information  hiding*  an 
modularity. 

We  have  developed  a  new  representation  which  support 
the  specification  of  liming  behavior  and  was  designed  wit 
simulation  and  incremental  synthesis  in  mind.  A  clea 
simulation  semantics  was  a  requirement  for  all  the  features  c 
the  model  enabling,  among  other  capabilities,  increments 
timing  constraint  checking  during  simulation.  We  hav 
implemented  a  simulator,  based  on  our  representation,  whic 
provides  empirical  verification  of  timing  b^avior. 

This  paper  is  divided  into  five  major  sections.  Section  I 
motivates  our  approach  to  timing  speciHcation  and  provide 
the  details  of  our  new  model.  Section  III  describes  oc 
simulator  and  how  simulation  efficiency  concerns  affected  th 
development  of  the  representation.  Section  IV  contains  a  larg 
example  which  demonstrates  the  usefulness  of  th 
representation  and  the  simulator.  Section  V  contains 
comparison  of  the  simulator  with  existing  high-leve 
simulators  and  concludes  the  paper  by  summarizing  th 
contributions. 
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II.  A  New  Model  for  Timing  Specification 

To  describe  timing  relationships  between  circuit  events  w 
need  to  identify  the  circuit  events  being  constrained  an< 
express  properties  which  the  events  need  to  satisfy.  Fo 
example,  a  setup  constraint  applies  to  two  circuit  events  (e.g. 
an  event  on  an  input  and  the  next  rising  clock  edge),  am 
requires  that  they  be  separated  by  a  fixed  amount  of  time  (e.g. 
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[time  of  edge  -  time  of  input]  >  5ns.).  Circuit  events  are  usually 
constrained  by  specifying  separation  times  —  both  absolute 
(“5ns")  and  relative  (“3  cycles”)  to  characterize  propagation 
delays,  clock  rates,  etc.  At  higher  behavioral  levels,  timing 
constraints  are  more  abstract  because  they  specify  separation 
times  between  abstract  behavioral  events.  Constraints  may 
apply  only  in  a  specific  behavioral  context  (e.g.,  uring  read 
and  not  write  operations)  or  as  a  result  of  a  specific  causal  path. 
Typically  timing  constraints  are  informally  specified  using 
timing  diagrams  and  ubles  (see  Figure  2). 


AdStESr 


Figure  2.  A  timing  diagram  from  the  specification  of  the 
Intel  Multibus  (Ij.  Constraints  specify  separation  times 
between  events  (i.e.,  changes  in  logic  level  on  signal  wires). 


To  identify  circuit  events  both  chronological  and  causal 
relationships  are  needed.  Chronological  relationships  rely  on 
time  as  a  means  of  identifying  the  events  being  constrained 
(e.g.,  the  nexi  clock  edge)  and  appear  often  in  many 
representations  (i.e.,  temporal  logics  [2.3],  hardware 
description  languages  (4),  waveform  algebra  [5],  and  many 
others).  Causal  relationships  are  equally  important.  For 
example,  in  Figure  3,  requests  to  a  non-HFO  queue  need  to  be 
acknowledged  in  less  than  100ms.  However,  in  this  case, 
chronology  caimot  be  used  to  pair  up  and  identify  the  request 
and  acknowledge  events  as  they  may  occur  in  a  different  order 
on  the  output  than  on  the  input  (i.e.,  a  later  request  may  be 
acknowledged  before  an  earlier  one). 
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Event  nodes  represent  changes  in  logic  level  on  circuit 
wires.  If  an  event  node  is  on  a  cycle  in  the  graph  then  it  may 
occur  multiple  times.  Therefore,  it  is  important  to  distinguish 
a  discrete  event  from  an  event  node.  An  event  node  represents 
an  event  that  may  occur;  a  discrete  event  is  an  actual  occurrence 
of  that  event  Operation  nodes  or  boxes  are  the  second  type  of 
node  in  our  graph  representation  and  correspond  to  the 
functional  aspects  of  behavior  and  structure.  Boxes  contain 
program  code  (e.g..  C-M-  source  code)  that  is  evaluated 
whenever  an  input  event  occurs.  The  evaluation  may 
conditioiudly  generate  output  events  which  will  occur  at  some 
future  point  in  time  and/or  possibly  change  internal  state.  An 
example  of  an  operation  node  is  a  logic  gate  that  may  generate 
an  output  event  whenever  an  input  event  occurs.  The  delay  in 
generating  the  output  event  corresponds  to  the  propagation 
delay  of  the  gate.  The  delay  is  specihed  by  either  a  fixed  value 
or  a  range  of  values  and  a  distribution  function  to  indicate 
uncertainty  with  respect  to  when  the  event  will  actually  occur 
(e.g.,  the  delay  is  dynamically  computed  each  time  a  discrete 
event  is  generated).  A  more  abstract  example  of  a  box  is  one 
that  forks  two  independent  processes:  an  input  event  arrives  at 
an  operation  node  that  will  then  cause  two  parallel  output 
events  thereby  permitting  two  parts  of  the  specification  to 
proceed  in  parallel.  The  events,  in  this  case,  do  not  correspond 
to  logic  transitions  and  instead  represent  abstract  control  flow. 

Dependency  arcs  specify  the  flow  of  events  in  to  and  out  of 
boxes.  The  graph  is  bipartite  because  dependency  arcs  specify 
flow  of  control  by  directing  events  into  boxes  and  the  output  of 
boxes  to  events.  Events  have  an  in-degree  of  at  most  one  , 
dependency  arc.  thus  each  event  is  either  an  external  event  or  is 
caused  by  me  execution  of  a  single  operation  node.  Events 
may  have  arbitrary  out-degree. 


Dependency  Arc 

Operation  Node  (Box) 


•  R(ck-t-) 
EvemNode 


oe_wlre  clt("clc"); 
oe  event  F  (“F“,  ck,  U)W)  ; 
oe  event  R  ("R",  cit ,  HIGH)  ; 

ratio  I 

ir  (trlgger«-R)  cause (F, 25) ; I 
rlseO  I 

1£  (trigger— F)  cause  (R,  25) ; ) 
main  ) 

oe_box  opl (“opl”,  fall); 
oe_box  op2(''op2'',  rise); 
connect  (opl,  F)  .-connect  (R, opl)  ; 
connect  (F,op2)  .-connect  (op2.R)  ; 


Figure  3.  A  non-FIFO  queue  that  acknowledges  requests  in  a 
different  order  than  the  requests  were  received. 


II. 1  The  OEgraph  Model 


Figure  4.  Graphical  and  textual  versions  of  a  simple  single 
phase  clock  in  the  OEgraph  represenution  (from  a  behavioral 
perspective).  The  clock  has  a  cycle  time  of  50  time  units  with 
a  50%  duty-cycle. 


The  representation  we  have  proposed  [6,7]  is  a 
straightforward  graph  model  whose  basis  consists  of  two  types 
of  nodes  connected  by  directed  arcs  to  form  a  bipartite  graph.  A 
restricted  predicate  calculus  is  used  to  express  timing 
constraints.  The  model  contains  a  novel  concept:  event 
ancestry,  to  permit  reasoning  about  causal  as  well  as 
chronological  relationships.  We  present  a  summary  of  the 
model  omitting  many  of  the  features  that  are  not  directly  related 
to  the  expressibility  of  timing  behavior  (the  applicability  of 
the  model  to  the  entire  design  space  is  discussed  in  detail  in 

(71). 


Event  nodes  that  affect  changes  on  the  same  wire  can  be 
grouped  into  an  event  node  set.  Since  every  possible  change 
on  the  wire  (every  discrete  event)  is  collected  into  the  set  we 
refer  to  such  a  set  as  a  wire.  In  practice,  wires  can  be 
considered  a  third  type  of  node  in  our  graph.  Operations  may 
have  wires  as  inputs  (any  event  that  is  a  member  of  the  wire  is 
an  input  to  the  operation)  and  may  also  have  wires  as  outputs 
(the  output  wire's  value  is  changed  via  an  implicit  internal 
event).  An  operation  can  ask  about  the  value  of  an  input  wire 
(i.e.,  the  effect  of  the  most  recent  discrete  event  on  that  wire). 
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The  model  contains  additional  elements  for  handling  buses  and 
other  structural  constructs. 


o«_wlr«  ck("ck'); 
o«_Hlre  D(*D*); 
o«_wlre  Q(*Q*); 
latch  0  ( 

U  (trlgger—ck  (t  ck— HIGH) 
causatQ,  unironn_delay (5,10) , 
valuaOKD);) 

main  ( 

o«_box  opl  (*opl*,  latch); 
connect  (ck.opl) ; 
connect (D,opl) ; 
connect (opl.O); ) 

Figure  5.  A  clocked  edge-triggered  D-latch  in  our 
represenution.  Note  that  the  propagation  delay  is  uniformly 
distributed  between  3  and  10  time  units. 


The  model  uses  an  event  based  paradigm  because  timing 
constraints  are  relationships  between  circuit  events  and  are 
more  easily  expressed  in  an  event  based  model  [8].  With 
respect  to  our  representation,  timing  constraints  are 
relationships  between  discrete  events  —  not  event  nodes.  For 
example,  the  setup  time  constraint  on  the  D-input  to  the  latch 
of  Figure  3  applies  to  any  event  on  D  and  the  next 
chronological  occuning  rising  edge  event  on  the  clock. 
Chronological  relationships  can  often  be  used  to  identify  the 
discrete  events  involved  in  a  constraint.  However,  constraints 
may  be  relative  to  a  particular  execution  path  in  a  complex 
graph,  and  constraint  specification  must  include  a  way  of 
getting  at  this  history.  Causal  relationships  also  need  to  be 
described  and  reasoned  about.  An  example  of  a  constraint 
requiring  a  causal  relationship  is  a  response  time  constraint  on 
requests  to  the  queue  of  Figure  3.  Causality  pairs  up 
corresponding  request  and  acknowledge  events  so  that  a 
maximum  separation  can  be  specified. 

We  have  developed  the  concept  of  event  ancestry  to 
address  the  specification  of  causal  relationships.  An  ancestor 
of  a  discrete  event  is  any  previously  occurring  discrete  event 
that  led  to  the  generation  of  its  descendant  through  its  effect  on 
a  series  of  boxes.  Thus,  every  discrete  event  has  an  ancestry 
tree,  consisting  of  its  immediate  ancestors  and  their  ancestors, 
transitively.  Whenever  an  operation  decides  to  generate  an 
event,  the  new  discrete  event  has  as  ancestors  the  most  recent 
discrete  occurrences  of  the  input  events  named  as  ancestors  (all 
inputs  by  default).  When  an  input  wire  is  named  as  an  ancestor, 
the  most  recent  discrete  event  on  the  wire  (as  seen  by  the 
operation)  is  the  ancestor.  Internal  state  is  treated  the  same 
way  as  output  events  in  that  internal  sute  variables  have 
ancestors  and  can  be  named  as  ancestors  of  output  events. 
When  a  state  variable  is  named  as  an  ancestor  all  of  the 
ancestor  events  of  the  state  variable  are  ancestors  of  the 
discrete  event.  This  permits  decomposition  of  operations. 

The  language  used  for  the  specification  of  timing 
coiutraints  (which  are  assertions  about  the  desired  behavior  of 
the  specification)  is  based  on  a  first-order  predicate  calculus 
that  consists  of: 

•  standard  Boolean  and  integer  functions  and  relations 

(  V,  A,  -,  +,  *,/,<,>,  =  ) 

•  quantifkatkin  of  discrete  events  and  a  test  for  equality 
(existential  “3”,  universal  “V”  and  equality  “=” ) 

•  a  relation  "e  ”  to  test  if  a  discrete  event  is  an  occurrence  of 
a  named  event  or  a  member  of  an  event  set 


•  an  integer  function  “t“  that  returns  the  time  at  which  a 
discrete  event  occurs 

•  a  relation  “anc”  to  test  whether  a  discrete  event  is  an 
ancestor  of  another 

•  a  function  *‘v”  returning  the  wire  value  for  a  discrete  event 

•  associated  wire  value  relations 

The  time  at  which  an  event  occurs  is  not  an  inOnite 
precision  real,  but  instead  is  an  iiueger.  Thus  multiple  events 
may  appear  to  occur  at  the  same  time,  due  to  a  natural 
granularity  problem  that  also  exists  in  the  physical  world  (i.e., 
at  some  level  of  detail  it  is  not  possible  to  determine  which  of 
two  events  actually  occurred  first).  This  affects  the  definitions 
of  the  primitive  relations  described  later  in  this  section. 

The  full  first-order  predicate  calculus  introduces  problems 
(discussed  in  the  next  section)  which  we  have  addressed  by 
restricting  the  representation  of  timing  constraints  to  the 
following  format:  1.  universal  quantification  of  the  discrete 
evenu  involved  in  the  constraint,  2.  specification  of  the 
context  within  which  the  constraint  must  hold  and,  3. 
specification  of  a  particular  timing  relationship  that  is  required 
to  be  true  when  the  context  is  true. 

quantified  discrete  events  timing  requirement 

in  a  =»  for  the 

context  discrete  events 

In  order  to  capture  much  of  the  expressibility  of  the  full 
calculus,  while  restricting  it  to  the  format  above,  we  added  the 
following  three  relations  (lower  case  variables  represent 
quantified  discrete  events). 

•  mra(x,S,y)  to  test  whether  x's  most  recent  S  ancestor 
(of  all  the  evenu  in  the  set  S)  is/might  be  y, 

mia(x,S,y)Bi  Vr  (anc(x,y)  a  (rd  Sv-ancestor(x,z)v  t(z)ST(y))) 

•  pco(x,Y,y)  to  test  whether  x’s  previous  chronological 
occurrence  of  an  event  in  the  set  Y  is/might  be  y. 
pco(x,Y,y)a  Vz  (y*xAt(y)ST(x)A  (z«  Yv  T(z)aT(x)v  T(z)ST(y)) 

•  nco(x,Y,y)  to  test  whether  x*s  next  chronological 
occurrence  of  an  event  in  the  set  Y  is/might  be  y. 
nco(x,Y,y^  Vz  (y»tXAT(y)2T(x)A  (z«  Yv  t(z)St(x)v  T(z)2T(y)) 

These  three  relations  are  formally  defined  using  the  full  calculus 
and  encapsulate,  in  a  more  efficient  and  compact  form, 
concepu  which  are  essential  for  timing  constraint  expression. 
All  constrainu  consist  of  combinations  of  these  relations  and 
the  logic  primitives. 

II2  Examples 

Figure  6  contains  the  specification  of  the  setup  constraint 
for  the  D-input  to  the  latch  of  Figure  3.  The  textual 
representation  for  the  calculus  is  very  straightforward. 
Constrainu  (and  discrete  evenu)  are  declared  and  specified 
using  the  three  part  format  described  above.  Constrainu  can  be 
specified  individually  as  shown  in  Figure  6  or  through  the  use 
of  subroutines  that  can  be  parameterized  (as  is  the  case  with 
common  constrainu  such  as  setup  and  hold). 

Most  coiutrainu  imposed  on  a  circuit  have  a  simple 
semantics  and  are  thus  easily  expressed.  For  example  a 
constraint  specifying  that  an  input  waveform  has  a  minimum 
pulse  width  is  easily  expressed:  quantify  two  edges  of  the 
waveform,  and  if  they  are  chronologically  related  (e.g., 
pco(edgel,waveworm,edge2)  then  they  must  be  separated  by 
the  minimum  pulse  width  (e.g.  T(edgel)-T(edge2)2  minpulse). 


WheNode 
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I  I 


setup  (D  to  R)>  10 
restricted  calculus: 


...  inserted  into  maini )  of  figure  S 

oa^constralnt  setup (‘setup*) ; 

dlscrete_nane  D0(*D0*,D); 

dlscrete_naiiie  R0(*R0*,R); 

setup. quantify (DO,  ROI  ; 

setup. context (nco(D0,R,R0) ) ; 

setup. require (tlmeOf (RO)-tlmeOf (00)  >10) ; 


Vr  Vd  ( ( re  R  A  de  D  A  nco(d,Rj) )  — » ( T(r)  -  T(d)  >  10 ) )  s 
full  calculus: 


Vr  Vd  3z  ('t(r)-t(d)>10  v  r«  R  v  dd  D  v  r=d  v  't(r)<T(d)  v  (ze  R 
A  T(d)<T(z)<T(r))) 


Figure  6.  Our  representation  of  the  setup  constraint  for  the 
D  input  to  the  latch  in  Figure  5. 


Many  constraints  which  appear  to  be  simple  in  nature 
actually  have  a  complicated  semantic  meaning.  For  example  a 
constraint  stating  that  “two  events  occur  one  cycle  apart"  is 
subject  to  many  interpretations.  Timing  diagrams  and  tables 
attempt  to  convey  the  semantics  of  such  constraints  but  are 
informal  specifications  and  are  often  insufficient.  Using 
OEgraphs  and  the  constraint  specification  language  described 
above  we  can  formally  specify  such  a  constraint 

Figure  7  contains  a  consoaint  that  states  that  events  A  and 
B  are  required  to  occur  exactly  one  cycle  apart  whenever  A  and  B 
occur  from  the  same  request  event  (REQ).  The  constraint  is 
complicated  by  the  fact  that  A  and  B  occur  synchronously  to 
the  falling  edge  of  the  clock  (F)  and  may  occur  anywhere  within 
the  shaded  region  shown  in  Figure  7  (requiring  an  exact 
separation  time  between  A  and  B  would  thus  be  incorrect).  This 
is  another  example  of  a  constraint  that  cannot  be  expressed 
without  the  use  of  event  ancestry.  For  example,  waveform 
algebra  (5]  cannot  be  used  to  describe  this  constraint  because  it 
cannot  pair  up  the  A  and  B  events  to  be  constrained. 


III.  Simulation 

As  mentioned  in  section  n  the  representation  for  timing 
constraints  is  a  restricted  version  of  the  full  first-order 
predicate  calculus.  In  particular,  the  calculus  only  supports 
universal  quantification  where  all  quantifiers  prece^  all 
clauses.  Existential  quantification  and  the  ability  to  negate 
quantifiers  was  completely  removed. 

These  simplifications  were  made  for  a  number  of  reasons. 
In  simulation,  the  universe  of  discrete  events  changes  as  new 
events  occur  and  are  added  to  the  universe.  Constraints  are 
statements  about  the  infinite  universe  of  events.  If  the 
universe  is  constantly  changing,  when  can  constraints  be 
checked  and  violations  reported?  Violations  can  be  detected 
and  reported  for  constraints  with  universal  quantification 
because  a  particular  instantiation  of  discrete  events  causes 
constraint  violation.  With  existential  quantification 
violations  often  can  never  be  reported  because  the  events  that 
satisfy  the-constraint  may  not  yet  have  occurred.  Of  course,  we 
can  report  satisfaction  for  existentially  quantified  constraints 
and  not  for  universal  ones,  but  in  simulation  (i.e.,  for 
validation  purposes)  constraint  violations  are  of  primary 
interest.  Often  we  would  not  be  able  to  conclude  anything 
about  constraints  that  contain  both  quantifications  and  the 
simulator  would  be  extremely  inefficient  because  constraints 
would  be  repeatedly  checked. 

In  addition,  event  generation  (existence)  is  represented  in 
the  functional  components  of  the  graph.  If  an  event  is  required 
to  exist  given  a  particular  set  of  circumstances,  an  operation 
can  be  defined  which  generates  the  event  given  the 
circumstances.  Therefore,  existential  quantification  —  which 
is  used  primarily  to  describe  functionality,  is  already 
encapsulaUMl  within  our  representation. 

Of  course,  there  are  some  relationships  that  can  be  easily 
expressed  in  the  full  calculus  which  do  not  suffer  from  the 
problems  described  above.  We  added  pco.  nco,  and  mra  to  our 
restricted  calculus  because  they  encapsulate  relations  which  are 
essential  for  timing  constraint  expression.  They  have 
semantics  that  are  easily  implemented  by  the  simulator  and  the 
three  relations  do  not  introduce  complications  which  would 
prevent  efficient  incremental  constraint  checking  —  the 
important  motivation  for  our  restrictions.  Of  course  restricting 
the  calculus  does  weaken  its  expressibility.  However,  based  on 
experience  with  many  example  specifications  we  believe  that 
there  is  no  effect  with  respect  to  the  specification  of  timing 
requirements  encountered  in  digital  systems. 

//U  Simulaiion  Efficiency 


quantify:  AO,  BO,  FO,  FI,  REQl 
contexu  pco(A0.F.F0)Apco(B0,F.Fl)A 

mra(A0,REQ.  REQl)  Amra(B0.REQ,REQl) 
requirement:  pco(F0.F.Fl)vpco(Fl,F,F0) 

Figure  7.  A  sequential  logic  constraint  requiring  two 
events  to  be  one  cycle  apart. 


The  context  for  the  constraint  establishes  the  identities  of 
the  quantified  events.  FO  is  the  clock  edge  prior  to  AO;  FI  is 
the  clock  edge  prior  to  BO;  AO  and  BO  share  the  same  REQ 
ancestor.  The  requirement  is  simply  that  FO  be  the  clock  edge 
prior  to  FI  or  that  FI  be  the  clock  edge  prior  to  FO. 


The  main  efficiency  concern  in  the  simulator  is  the 
incremental  checking  of  timing  constraints.  Whenever  a  new 
discrete  event  occurs,  all  constraints  that  quantify  the  event 
must  be  checked.  The  new  event  is  quantifi^  in  the  constraint 
and  all  possible  combinations  of  discrete  events  (that 
previously  occurred)  are  tried  for  the  other  quantified  events. 
This  mechanism  ensures  that  each  constraint  will  only  be 
checked  once  for  each  possible  assignment  of  unique  discrete 
events.  For  example,  consider  the  following  constraint: 
“quantify:  Xo.Yi  context:  t(Xo)>t(Yi)  and  some  requirement". 
If  X  events  had  previously  occurred  at  time  5  and  12  and  a  new  Y 
event  occurred  at  time  17,  the  constraint  would  be  checked  at 
time  17  with  Y@17  instantiated  for  Y  i  znd  with  X@S,  and  then 
X@12  instantiated  for  Xq.  If  a  new  X  event  occurred  at  time  20 
the  constraint  would  be  checked  with  X(§20  instantiated  for  Xq 
and  with  Y@I7  instantiated  for  Y^.  Constraint  checking 
requires  an  exponential  amount  of  time  with  respect  to  the 
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number  of  events  quamifled  by  a  constraint.  Fortunately,  most 
timing  constraints  contain  only  two  quantifications.  This  is 
because  most  constraints  require  a  separation  time  between  two 
events,  and  a  simple  context  (i.e.  a  simple  chronological  or 
causal  relationship)  is  sufficient  to  identify  the  discrete  events 
being  constrained. 

Many  other  optimizations  are  performed  to  make  the 
simulator  more  efficient.  The  occurrence  of  an  event  need  not 
always  trigger  the  evaluation  of  a  constraint  quantifying  that 
event  as  described  above.  If  a  constraint’s  context  strictly 
orders  the  time  of  occurrence  of  two  events  then  a  new  event 
need  not  be  instantiated  into  the  earlier  event  because  it  is 
known  that  the  context  will  be  false  (the  later  event  has  not  yet 
occurred).  For  example,  since  'i(Xo)>'t(Yi)  new  occurrences  of 
Y  do  not  require  checking  of  the  constraint.  This  static 
optimization  is  also  applied  to  constraints  which  refer  to 
ancestry  because  an  ancestor  must  always  occur  before  its 
descendant.  Similar  optimizations  are  made  for  constraints 
involving  the  chronological  relations  (pco  and  nco)  and  for 
events  which  are  known  to  occur  at  the  same  time.  In  this  case, 
only  one  of  the  evenu  needs  to  trigger  constraint  evaluation. 

The  optimization  described  above  helps  reduce  the  amount 
of  time  required  by  constraint  checking.  However,  as  the 
simulation  progresses  and  new  events  occur,  constraint 
checking  becomes  more  time  consuming  because  more  events 
are  instantiated  each  time  a  constraint's  context  is  evaluated. 
This  problem  is  related  to  another  efficiency  concern:  the 
amount  of  space  required  to  store  past  events  and  maintain  the 
directed  acyclic  ancestry  graph.  This  is  particularly 
troublesome  in  that  larger  simulations  usually  generate  lots  of 
events,  and  some  events  (e.g.  clock  edges)  occur  very 
frequently.  However,  it  should  be  pointed  out  that  larger 
simulations  (i.e.  large  OEgraphs)  are  not  inherently  less 
efficient  to  simulate.  Timing  constraints  apply  to  small 
numbers  of  events,  irrespective  of  graph  size,  and  the 
efficiency  of  the  simulation  engine  is  related  to  the  amount  of 
parallelism  inherent  in  the  graph,  not  the  graph  size. 

One  approach  to  this  problem  is  discrete  event  removal. 
Discrete  events  can  be  removed  from  storage  if  it  can  be  shown 
that  they  are  no  longer  needed  for  constraint  checking.  In  the 
simple  case,  if  an  event  is  not  a  part  of  any  quantified  set,  it 
need  not  be  stored.  Before  removing  the  discrete  event,  the 
ancestry  information  which  is  contained  in  the  discrete  event 
node  needs  to  be  pushed  outward.  This  is  accomplished  by 
cormecting  the  children  of  the  node  to  the  parents  of  the  node 
before  deleting  the  node.  We  intend  to  extend  this  simple 
optimization  (which  has  been  implemented)  to  support  the 
removal  of  discrete  events  even  when  they  are  quantified  and 
appear  in  timing  constraints.  Many  constraints  involve 
simple  chronological  relationships  that  do  not  require  storing 
complete  histories  (e.g.  instead  of  storing  every  clock  edge, 
only  the  two  most  recent  edges  are  stored  since  they  are  the 
only  ones  involved  in  constraints)  and  it  should  be  possible  to 
determine  a  priori  how  many  events  need  to  be  stored.  With 
respect  to  causality,  often  only  the  more  recent  causal  chains 
are  important,  and  in  this  case  an  event  removal  technique  akin 
to  garbage  collecting  could  operate  periodically  and  effectively 
prune  the  ancestry  DAG. 

For  a  given  constraint,  if  analysis  can  prove  that  under 
any  possible  execution  of  the  specirication  the  context  is 
always  false,  the  constraint  would  not  need  to  be  checked. 
Likewise,  the  constraint  would  not  need  to  be  checked  if  any 
possible  execution  would  result  in  the  requirement  always 
being  true.  Such  a  simulation  optimization  tool  (capable  of 
detecting  these  two  conditions)  would  constitute  a  very 
powerful  veriflcation  tool. 


/II.2  Implemenution  and  Interface 

OEsim  is  a  compiled  simulator  that  takes  as  input  a  C-t-i- 
description  of  an  OEgraph  and  its  associated  constraints  and 
generates  a  compiled  and  linked  form  that  includes  the 
simulator  front-end.  This  produces  a  single  execuuble 
simulation  program.  By  virtue  of  being  a  compiled  C-m- 
program,  an  operation  node's  program  can  include  arbitrary 
C++  code  that  ean  be  used  to  provide  special  interactions  with 
the  user  (e.g.,  read  and  write  data  files).  The  simulation  steps 
through  the  execution  of  the  OEgraph  and  contains  commands 
which  support  user  control  (e.g.,  single  stepping,  setting 
breakpoinu,  scheduling  events,  etc.).  Figure  8  conuins  an 
example  simulation  showing  a  violation  of  the  setup 
constraint  described  in  Figure  6. 


Welcome  To  Simulation  vl.3,  Mon  Nov  5  15:13:43  1990 
...  no  stimulus  file  (figS.itf)  found 
oesim-0>  schedule-event  F  0 
oesim-0>  schedule-wire  0  HIGH  60 
oesim-0>  schedule-wire  0  LOW  120 


oesim-0>  run-to 
event_occurs  at 
event_occurs  at 
event_occurs  at 
eventoccurs  at 
event_occurs  at 
event_occurs  at 
event_occurs  at 
event  occurs  at 


150 

time:  0  event  F 

time;  25  event  R 

time:  50  event  F 

time:  60  event  DS<external> (D'HIGH) 

time:  75  event  R 

time:  100  event  F 

time:  120  event  DS<extemal>  (D*LOW) 

time:  125  event  R 


••  Constraint  Violation:  setup: 

RO  DO  ;  nco (DO,  R,  RO)  — >  ( (t (RO)  -  t(D0))  >10) 

RO  »  unique  event;  R  occurrence:  3  at  time:  125 
DO  =  unique  event:  DS<external>  occurrence:  2  at 
time:  120 


stopped  at  time:  150 

Figure  8.  An  example  simulation  showing  a  violation  of 
the  setup  constraint  described  in  Figure  6. 


The  simulator  was  implemented  using  C-m-  in  a  UNIX 
environment  and  has  already  been  instrumental  in  debugging 
specifications.  We  have  used  our  simulator  to  describe  a  wide 
range  of  examples  derived  from  real  circuits  or  extracted  from 
the  specification  and  synthesis  literature — the  largest  being  a 
partial  specification  and  simulation  of  the  Intel  Multibus  (see 
{?]).  We  have  yet  to  analyze  the  performaiKe  of  OEsim  in 
detail,  but  have  found  it  to  be  efficient  and  capable.  Compile 
time  (a  few  minutes)  has  always  exceeded  simulation  time 
except  for  un-optimized  simulations  containing  constraints 
that  quantify  many  events.  Space  efficiency  has  not  been  a 
problem  (i.e.  storing  over  a  million  events)  but  further  work  is 
needed  to  support  larger  and  more  lengthy  simulations. 

At  the  University  of  California  at  Berkeley,  OEsim  is 
being  used  to  represent  the  abstract  interfaces  of  complex 
components  that  must  be  intercoimected  on  a  printed-circuit 
board  or  multi-chip  module.  The  result  is  a  simulation  module 
for  the  glue  logic  that  must  be  designed  and  an  understanding  of 
the  many  timing  relationships  that  must  be  maintained  when 
the  components  are  interconnected. 

V.  Conclusion 

OEgraphs  were  designed  with  simulation  in  mind.  A  clear 
simulation  semantics  was  a  requirement  for  all  the  features  of 
the  model.  An  important  goal  consisted  of  modeling  genera] 
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timing  constraints  expressed  using  both  chronoiogical  and 
causal  relationships.  The  difficulty  with  timing  constraints 
was  ensuring  that  the  calculus  used  for  their  specification  had  a 
clear  simulation  semantics  and  would  support  incremental 
constraint  checking.  This  was  accomplished  with  the  three 
new  primitives  and  format  restrictions  outlined  in  section  n. 

Existing  simulators  (i.e.,  THOR  [9],  CX)SMOS  [10],  VHDL 

[11] )  focus  primarily  on  functional  aspects  of  both  behavior 
and  structure.  To  consider  timing  constraints,  users  have  to 
develop  functional  modules  to  check  timing  properties  and 
report  their  satisfaction/violation.  This  is  precisely  the 
approach  taken  in  VAL  which  uses  a  VHDL  simulation  engine 
and  augments  the  VHDL  with  assertions  in  waveform  algebra 

[12] .  It  is  also  important  that  these  types  of  structured 
approaches  are  used  rather  than  the  more  common  ad  hoc 
methods.  Specification  of  timing  constraints  is  very  error 
prone  and  must  be  done  consistently  to  be  useful. 

There  are  many  reMsons  for  developing  a  new 
representation  and  simulator  instead  of  attaching  our  constraint 
language  to  a  VHDL  simulator.  First,  complete  VHDL  is  very 
difficult  to  synthesize  and  we  would  have  tad  to  use  a  subset  of 
the  language.  Many  constructs  in  the  mguage  allow  the 
manipulation  of  the  simulation  model  ar  ^  event  queues.  It 
is  not  at  all  clear  how  to  synthesize  de:  ptions  using  these 
constructs.  Second,  to  be  able  to  refer  tc  aus  .elationships 
among  VHDL  events  would  have  meant  moaifyu  :  fundamental 
data  structures  to  include  the  ancestry  trees.  Third,  many 
aspects  of  behavior  are  best  represented  using  an  event  model 
that  is  not  completely  supported  in  VHDL  which  is  based 
primarily  on  a  signal  wire  model.  Lasdy,  we  wanted  the 
capability  of  including  arbitrary  C-m-  code  in  our  simulations 
so  that  user  interfaces  and  other  interesting  I/O  could  be 
directly  incorporated  in  the  execuubie  for  the  simulation. 

Implementation  of  OEsim  is  complete,  although 
additional  work  to  improve  the  user  interface  and  to  extend 
existing  efficiency  optimizations  needs  to  be  done.  Our 
research  emphasis  has  moved  Grom  OEsim  to  other  tools  which 
operate  within  the  same  framework.  OEgraphs  give  time  and 
timing  constraints  their  long  deserved  equivalent  status  with 
structural  and  functional  specifications.  We  believe  the 
representation  provides  a  target  for  the  development  of 
domain-specific  description  languages  and  a  basis  for 
incremental  synthesis  algorithms.  The  focus  of  our  synthesis 
work  is  on  control  logic  synthesis  and  scheduling  algorithms. 
OEsim  will  be  critical  in  verifying  our  efforts. 
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Abstract.  In  synthesizing  a  circuit  from  its  description  in  a 
concurrent  programming  language,  it  is  necessary  to  make 
decisions  about  how  to  implement  synchronization  constructs 
such  as  send  and  receive  statements.  The  semantic  model  of 
these  constructs  is  an  infinite  length  FIFO  queue  that  can 
handle  all  send  events  until  they  are  paired  up  with 
corresponding  receive  events.  In  this  paper,  we  describe  an 
algorithm  to  size  these  synchronization  queues  while 
permitting  the  maximum  parallelism  between  the 
communicating  processes  (circuits).  It  is  an  example  of  higher 
level  synthesis  in  that  the  user  does  not  include  an  explicit 
description  of  the  queue  in  the  specification  as  is  necessary  in 
cunent  high  level  synthesis  systems. 

1.  Introduction 

High  level  synthesis  is  the  process  of  deriving  hardware 
implementations  for  circuits  from  concurrent  programming 
languages  or  other  high  level  specifications  [1].  These 
specifications  may  include  statements  that  are  not  inherently 
implementable.  An  example  of  this  is  the  send  and  receive 
statements  used  for  inter-process  communication.  These  are 
especially  important  in  the  specification  of  digital  signal 
processing,  communications,  and  protocol  circuits.  The 
semantics  of  send  and  receive  is  that  two  processes  are 
connected  by  an  infinite  length  FIFO  queue  that  can  handle  all 
send  events  so  that  they  can  be  paired  up  with  the 
corresponding  receive  event.  In  current  high  level  synthesis 
systems,  the  user  must  explicitly  bound  the  size  of  the  queue 
and  ensure  that  the  two  parallel  processes  (circuits)  will  never 
need  to  exceed  that  bound.  This  is  normally  achieved  by  the 
user  placing  additional  control  statements  in  the  specification 
(e.g.,  in  the  form  of  handshaking  signals)  (2). 

This  leads  to  design  style  that  is  not  modular  and  may  also 
unduly  limit  the  parallelism  inherent  in  the  specification.  A 
higher  level  synthesis  system  should  be  able  to  use  timing 
constraints  on  the  rate  of  send  and  receive  events  to  compute  a 
bound  for  the  queue  size.  As  different  interconnections  are 
made  between  components  it  should  not  be  necessary  to  change 
their  internal  specification  (as  is  currently  the  case>.  If  no 
bound  is  computable  then  the  system  could  automatically 
modify  the  specification  to  include  control  logic  to  ensure  that 
a  certain  size  is  not  exceeded.  The  tradeoff  between  the  size  of 
the  queue,  the  complexity  of  the  control  logic,  and  the 
parallelism  in  the  circuit  could  be  under  the  direction  of  the 
synthesis  system  and  not  the  user.  By  sizing  queues 
automatically,  different  tradeoffs  can  be  made  for  different 
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circuit  configurations  without  changing  the  original 
specifications  of  the  circuit  components  (processes). 

Our  model  of  a  send/receive  queue  is  shown  in  Figure  1. 
From  a  user's  perspective,  the  queue  is  of  infinite  length  —  the 
user  does  not  specify  a  size  for  the  queue,  it  is  assumed  that  the 
queue  will  be  as  large  as  needed  to  process  every  send  event  and 
store  its  data  until  a  receive  event  indicates  that  the  data  should 
be  output.  From  a  synthesis  perspective,  there  are  two 
possibilities  depending  on  whether  the  queue  size  can  be 
bounded  or  not.  An  unboundable  queue  will  necessitate  explicit 
handshaking  signals  that  can  be  used  to  block  the  producer 
and/or  consumer  processes  (thereby  restricting  some  possible 
parallelism).  A  bonded  queue  my  be  implement  without  explicit 
handshaking  and  if  it  is  of  a  small  enough  size  (i.e..  one  or  two 
entries)  may  be  removed  altogether  through  a  merging  of  the 
two  processes.  In  this  paper,  we  explore  the  problem  of 
determining  the  size  of  a  synchronization  queue  given 
constraints  on  the  relative  frequency  of  its  send  and  receive 
events. 


Send  Dataln 


Receive  ValidData  DataOut 


Figure  1.  Our  model  of  a  generic  HFO  queue.  Send  evenu 
indicate  that  dau  should  be  captured  by  the  queue.  Receive 
events  indicate  that  the  environment  is  ready  to  receive 
dau.  The  queue  issues  ValidData  events  to  indicate  that  it 
has  taken  dau  out  of  its  FIFO  and  placed  it  on  the  output. 

This  paper  is  divided  into  four  major  sections.  Section  II 
describes  the  semantics  of  two  panicular  queue  models  and  their 
liming  constraints.  Section  ID  describes  a  general  uchnique 
for  sizing  synchronization  queues.  Section  IV  provides  an 
example  and  Section  V  concludes  the  paper  summarizing  the 
contributions. 

II.  Queue  Models  and  Timing  Constraints 

Many  types  of  queues  can  be  derived  from  the  generic  queue 
model  presented  in  Figure  1.  A  non-blocking  queue  inurprets 
receive  evenu  as  requesU  for  output  which  can  be  ignored  if  the 
queue  is  empty.  The  receive  evenu  do  not  block  and  arrive 
periodically  to  poll  the  queue  for  dau.  A  blocking  queue 


28th  ACM/IEEE  Design  Automation  Conference* 

®  1991  ACM  0-89791-395-7/91/0006/0690  SI  JO 


Paoer  39.5 
690 


interprets  receive  events  as  requests  for  output  which  must  be 
satisfied  —  when  the  queue  is  empty  a  receive  event  blocks  (no 
additional  receive  events  will  occur)  until  a  send  event  arrives 
and  the  receive  is  acknowledged  by  ValidData.  These  two 
queues  appear  often  in  the  context  of  digital  signal  processing, 
communications,  and  memory  and  system  bus  interfaces. 

Asynchronous  input  data  can  be  synchronized  by  a  non- 
blocking  queue  (e.g.,  every  falling  clock  edge  corresponds  to  a 
receive  event  which  indicates  that  data  can  be  output  and  thus 
synchronized  to  the  clock).  In  a  blocking  queue  send  and 
receive  events  are  paired  up;  this  corresponds  to  the 
send/receive  constructs  found  in  many  concurrent  prograiruning 
languages. 

The  two  queues  and  their  timing  constraints  are  formally 
defined  using  OEgraphs  which  provide  a  formal  semantics  and  a 
framework  for  our  analysis  [3,4].  In  this  paper,  we  present  a 
simplified  syntax  for  timing  constraints  and  rely  on  the 
informal  descriptions  of  the  queues  presented  above. 

Timing  constraints  express  information  about  the  relative 
frequency  of  send  and  receive  events.  How  quickly  events  can 
occur  is  specified  using  the  constraint  “En  2  f’  which 
informally  states  that  the  n^  next  occurrence  of  E  (En)  relative 
lo  any  E  event  (Eq)  takes  place  at  least  f  time  units  after  Eq. 
Thus,  “Sj  2  5"  states  that  send  events  (S  =  send.  R=receive)  are 
separated  by  at  least  5.  The  constraint  <  f  specifies  the 
slowest  rate  that  events  can  occur.  Thus.  ''R3  <  10"  states  that 
the  third  receive  event  after  any  receive  must  take  place  within 
10  time  units. 

The  constraint  definitions  given  above  rely  on  being  able 
to  fully  order  the  occurrence  of  events  (e.g..  to  be  able  to  talk 
about  the  3rd  receive  event  after  another  receive  event).  In 
OEgraphs  the  time  at  which  an  event  occurs  is  not  an  infmite 
precision  real,  but  instead  is  an  integer  [Ij.  Thus  two  events 
may  appear  to  occur  at  the  same  time  due  to  a  natural  granularity 
problem  that  also  exists  in  the  physical  world  (i.e..  at  some 
level  of  detail  it  is  not  possible  to  determine  which  event 
actually  occurred  ftrst).  Timing  constraints  express  properties 
that  are  true  given  any  interpretation  for  the  order  of  events 
that  occur  at  the  same  time  (i.e.,  within  the  same  time  grain). 
The  two  constraints  are  more  formally  dcfmed  as  follows; 

E||  ^  f  5  given  n  and  f  positive  integers  and  two  pomts  m 
time  ti  and  t2.  if  there  are  n-fl  or  more  Es  in  the  closed 
interval  [ti,t2]  then  t2-ti  2  f. 

Em  ^  g  s  given  m  and  g  positive  integers  and  two  point 
in  time  tj  aixi  t2.  if  there  are  m-^l  or  more  Es  in  the  closed 
interval  [ti,t2l  and  less  than  m  Es  in  the  open  interval 
(ti,t2)  then  t2-ti  5  g. 

Our  analysis  would  be  conceptually  easier  if  we  assumed  that 
every  receive  or  send  event  was  separated  by  at  least  1  time 
uniL  However,  we  consider  the  more  general  case  in  which  two 
events  can  appear  to  take  place  at  the  same  time.  This  is 
important  because  our  underlying  semantics  is  based  upon 
OEgraphs.  Also,  if  more  complicated  queues  (e.g..  multiple 
input  ports)  are  modeled  as  simple  queues  then  the  ability  to 
handle  inputs  which  appear  to  arrive  at  the  same  time  is  crucial. 

More  than  one  constraint  may  be  available  for  a  given 
event  (e.g.,  3)2  2  and  S5  i  30).  In  some  cases,  one  of  the 
constraints  may  be  redundant  (e.g.,  if  S}  2  2  then  S3  2  4  is 
redundant  because  S3  2  6  can  be  inferred  from  S]  2  2)  because 
one  or  more  other  constraints  can  be  used  to  derive  a  more 
restrictive  constraint.  Redundant  constraints  can  be  removed 
but  our  results  are  not  altered  in  the  presence  of  redundant 
constraints  and  it  is  more  efficient  to  process  them  than  it  is  to 


detect  and  remove  them.  In  some  cases,  the  constraints  will  be 
ill-formed  and  the  specification  may  be  considered  nonsensical 
(e.g..  S3  2  3  and  S]  <  4). 

Our  analysis  is  also  based  on  an  important  assumption. 
“EmSg"  is  formally  defined  as  a  timing  constraint  that  is 
satisfied  when  E  events  occur:  it  does  not  state  that  the  events 
must  occur.  Since  we  are  interested  in  operating  circuits,  we 
assume  that  the  events  will  necessarily  occur,  and  are  interested 
m  the  steady  state  behavior  (i.e..  we  assume  that  receive  events 
will  not  start  occurring  arbitrarily  later  than  the  send  events). 

It  is  important  to  note  here  that  our  usage  of  queues  is  quite 
different  than  in  queueing  theory.  The  constraints  on  events 
are  not  statistical  averages  but  are  abstractions  of  the 
propagation  delays  inherent  in  the  hardware  used  to  construct 
these  circuits.  As  such,  they  are  deterministic  and  enable  us  to 
set  firm  bounds  on  the  size  of  queues  and  even  consider  their 
elimination  by  transforming  the  queue  into  other  memory 
elements  such  as  registers. 

III.  Queue  Sizing 

Queue  sizing  is  accomplished  by  comparing  the  fastest 
rate  at  which  send  events  can  arrive  with  the  slowest  rate  at 
which  receive  events  must  arrive.  Before  we  present  the 
algorithms  for  determining  the  queue  size  some  preliminaries 
are  in  order  including  ascertaining  that  the  constraints  supplied 
are  well-formed. 

Intervals  are  specified  using  two  integer  lime  values  and 
may  be  open  (t),t2)  or  closed  [11,12).  Open  intervals  may  be 
converted  to  closed  intervals  since  times  are  integer  values 
(e.g.,  (ti.t2)  =  [ll+l.t2-ll  )•  We  denote  the  maximum  number 
of  E  events  that  can  occur  in  the  closed  interval  [ti,t2)  as 
mostEin[ti  ,t2)  and  similarly  denote  the  least  number  as 
leastEin[t],t2].  Note  that  mostEin  and  leastEin  are  based  on 
the  size  of  the  interval  (e.g.,  12-tl)  not  the  actual  values  of  t] 
and  t2.  Regardless  of  any  constraints  on  E  events  if  ti>l2  then 
mostEin[ti,t2l  =  0  and  leastEin(ti,t2j  =  0.  In  the  absence  of 
any  constraints  on  E  events  and  if  tiSt2  then  mostEin[l].t2]  = 
oo  and  leasiEin[ti.t2j  =  0.  The  constraints  on  an  event  are  ill- 
formed  if  there  exists  an  interval  in  which  the  minimum  number 
of  events  constrained  to  exist  by  a  constraint  is  greater  than 
the  maximum  number  of  events  allowed  by  another  constrami. 

Theorem  1:  Given  any  number  of  constraints,  if  g/m  <  f/n 
(for  some  and  Em^g  constraint)  then  the  constraints  are 
ill-formed. 

Proof:  Consider  an  interval  of  size  fg  -  1.  it  must  have  at 
least  ng  Es  (from  Eng  2  fg  which  is  derived  by  multiplying  the 
constraint  En  2  f  by  g)  and  can  have  no  more  than  mf  Es  (from 
Emf  i  gf  which  is  derived  by  multiplying  the  constraint  Eni^  g 
by  0-  But  ng  <  mf.  a  contradiction  which  indicates  that  the 
constraints  are  ill-formed. 

Given  a  set  of  well-formed  constraints  for  both  the  send  and 
receive  events  of  a  queue  it  is  then  possible  to  determine  if  the 
queue  size  is  bounded  or  not. 

Theorem  2:  Given  any  number  of  constraints  on  S  and  R 
that  are  well-formed; 

ifMax(aiiRnj<gl(nVg)  ^  Min|aii  Sn2fl(n/n 
then  the  queue  is  boundable. 

ifMax(aiiRn,5gl(in/g)  <  Min|aii  Sn2fl(n/0 
then  the  queue  is  not  boundable. 

Proof:  The  justification  for  this  theorem  is  rather  intuitive. 
“Max(gH  R„,5g|(m/g)"  represents  the  slowest  rate  at  which 
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receive  events  must  strive  and  “Min(aii  represents 

the  fastest  rate  at  which  send  events  could  arrive.  A  necessary 
(but  not  sufficient)  condition  for  bounding  the  queue  is  that  on 
average  the  number  of  R  events  that  occur  is  greater  than  or 
equal  to  the  number  of  S  events. 

If  the  queue  is  bounded  then  for  a  given  time  interval,  we 
contrast  the  maximum  number  of  send  events  that  may  have 
occurred  with  the  minimum  number  of  receive  events  that  must 
have  occurred  during  the  interval.  The  size  of  the  queue  is  the 
maximum  difference  over  all  possible  interval  sizes.  The 
algorithm  for  determining  maximum  queue  sizes  is  given 
below; 


Theorem  4:  Given  EmS  g  then  ieast£in[t,t-t-I]  =  mL(I+l)/gJ. 
Proof:  Every  interval  of  size  g-1  (i.e.,  [t,t+g-l))  must  have 
at  least  m  Es,  otherwise  the  first  E  (or  one  of  the  first  if  tied) 
that  occurred  before  the  interval  has  as  its  m*^  next  E  an  E  after 
the  interval,  this  would  directly  contradict  the  definition  of 
EmS  g.  The  closed  interval  [t,t+I]  consists  of  Ll/gJ 

subintervals  of  size  g  -  1  (i.e.,  lt,t+g-l],  [t+g,t+2g-l] . 

lt+gLiygJ.t-i-1])  each  of  which  has  at  least  m  Es,  and  one  interval 
of  size  I  -  gLl/gJ  which  has  at  least  m  Es  if  I  -  gLi/gJ  =  g  -  1. 
The  summation  of  mLl/gJ  plus  m  if  I  -  gLi/gJ  =  g  -  1  easily 
simplifies  to  mL(I-t-l)/gJ. 


I  =  maxQ  =  0; 
repeat 

Imax  =  mostSin[0,I]-leastRin(0,I) 
maxQ  =  Max(maxQ,  Imax) 

1  =  1  +  1 

until  (Imax  £  0) 

The  algorithm  terminates  when  a  negative  or  zero  sized  queue  is 
achieved  for  a  given  interval.  We  don't  need  to  consider  larger 
intervals  because  somewhere  in  the  larger  interval  the  queue 
will  be  empty  and  the  larger  interval  consists  of  smaller  sub- 
intervals  which  have  already  been  analyzed.  The  only  problem 
with  the  algorithm  as  presented  is  that  there  is  no  guarantee 
that  it  terminates  (in  fact,  it  would  not  terminate  if  it  were  used 
on  unboundable  queues).  When  the  rates  of  S  and  R  events  are 
equal,  the  algoritlun  may  not  terminate  because  the  queue  may 
never  empty  but  still  contain  a  boundable  number  of  entries. 
For  this  special  case,  the  loop  termination  condition  changes 
to: 


until  (I  <t  least  common  multiple  of  the 

g  inMax(aU  RmSg)("^8)  and  the  f  in  Min{all  Sn^l("/f)) 

The  algorithm  terminates  after  running  for  a  fixed  number  of 
intervals.  Larger  intervals  are  not  considered  because  the 
behavior  of  the  queue  follows  a  worst-case  pattern  which  is 
cyclical. 

The  algorithms  above  rely  on  two  functions  moslln  and 
leastin  that  return  the  number  of  events  (maximum  and 
minumum  number,  respectively)  that  can  occur  in  an  interval. 
Note  that  the  minimum  number  in  the  open  interval  is 
subtracted  from  the  maximum  number  in  the  closed  interval. 
This  is  necessary  to  properly  handle  events  that  may  occur  in 
the  same  time  grain.  A  numerical  algorithm  is  required  because 
the  maximum  and  minimum  over  all  intervals  caimot  be 
determined  aiulytically  due  to  the  discontinuous  nature  of  the 
two  functions. 

The  values  returned  by  mostin  and  leastin  are  dependent  on 
the  constraints  on  the  events  and  the  size  of  the  interval.  We 
now  consider  the  definition  of  these  two  functions  and  begin 
with  two  simple  theorems  that  apply  when  only  one  constraint 
on  an  event  has  been  specified. 

Theorem  3:  Given  En  2  f  then  mostEin(t.t+I]  =  n(Ll/fJ  +1). 
Proof:  If  I  <  f  there  can  be  at  most  n  Es  in  (t,t+I]  because  if 
there  were  n+l  or  more  Es.  this  would  directly  contradict  the 
definition  of  En  2  f.  If  I  2  f  then  (t,t+I]  consists  of  Ll/fJ  +1 
subintervals  of  size  i  f  (i.e.,  (t,t+f-l],  [t+f,t+2f-l],  ... 
[t+ll/fit+I])  each  of  which  can  have  at  most  n  Es. 


We  next  consider  cases  in  which  there  are  multiple 
constraints  specified  for  an  event  (including  both  minimums 
and  maximums).  The  additional  constraints  provide  further 
information  which  may  tighten  the  bounds  on  the  minimum 
and  maximum  number  of  events  that  can  occur  in  a  given 
interval.  For  example,  a  constraint  stating  a  minimum 
separation  can  actually  increase  the  number  of  events  that  must 
occur  in  a  given  interval  (e.g.,  given  just  E3  <  10  an  interval  of 
size  S  is  not  required  to  have  any  Es,  but  if  we  add  E|  ^  2  then 
the  Es  satisfying  E3  <  10  cannot  occur  together,  and  an  interval 
of  size  S  is  now  required  to  have  at  least  one  E). 

Theorem  5:  Given  any  number  of  E^  ^  f  constraints  and  any 
number  of  E^i^  g  constraints  then: 

mostEin[t,  t+I)  =  Min{,]i  En^f)  of: 

(n  -  ieastEm[(I  mod  0+l.f-l], 

(n.  ifl<f. 

\jnostEin[0J  mod  f),  otherwise 


leastEin[t,  t+n  =  Max{aii  EmSg)  of: 

m  -  mostEin[(I  mod  g)+l,g-l] 
mLl/gJ  +  Max  fo,  ifl<g, 

V  \leastEin[0,I  mod  g],  otherwise 

Proof:  (Omitted  due  to  space  constraints.)  Intuitively,  the 
two  expressions  are  cross-coupled  recurrence  relations  bMause 
a  maximum  (minimum)  constraint  can  effect  the  spacing  of 
events  in  an  interval  constrained  by  a  minimum  (maximum) 
coiutraint.  We  take  the  minimum  (maximum)  over  all 
constraints  so  as  to  obtain  the  most  constrained  result  The 
basic  expression  sums  the  events  in  all  sub-intervals  of  size  f 
(g)  and  then  takes  the  minimum  (maximum)  of  n  (0)  and  the 
number  of  events  constrained  to  occur  in  the  interval  left  over 
of  size  less  than  f  (g). 

We  now  present  an  algorithm  (see  below)  to  solve  the 
cross-coupled  recurrence  relations  and  yield  the  maximum  and 
minimum  number  of  events  in  an  interval  of  a  given  size,  given 
any  number  of  well-formed  constraints  on  the  events.  It  is 
based  on  two  arrays  (most  and  least)  of  size  equal  to  the 
intervals  in  question.  The  arrays  are  initialized  up  to  the  size  of 
largest  of  all  the  fs  and  g's  (T).  Then  each  constraint  is 
applied  to  the  arrays  up  to  the  size  of  the  interval. 

The  initialization  dominates  the  complexity  of  the 
algorithm  which  otherwise  coiuists  of  a  single  application  of 
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the  constraints  to  the  interval  in  question.  Constraints  are 
repeatedly  applied  until  mostEin  has  been  minimized  and 
leastEin  has  been  maximized.  TTie  complexity  of  the 
initialization  is  easily  (albeit  pessimistically)  analyzed:  After 
one  iteration,  most  has  an  upper  bound  of  Maxjaji  EnSfl^^l-T/fJ 
+n)  and  the  least  has  a  lower  bound  of  0  (from  the 
initializations  to  0  and  infinity  and  one  application  of  the 
constraints).  Regardless  of  the  number  of  iterations,  most  has 
a  lower  bound  of  0  and  least  has  an  upper  bound  of 
^**{all  EmSgl(**'LT/gJ  +m).  Every  iteration  must  either 
reduce  most  or  increase  least  for  some  interval  ^  T.  This  gives 
a  very  pessimistic  upper  bound  on  the  complexity  of  the 
algorithm.  The  algorithm  is  linear  with  respect  to  the  number 
of  constraints  and  has  constants  that  are  polynomial  with 
respect  to  the  actual  numbers  specified  in  the  constraints  (i.e.. 
cuUc  with  respect  to  T).  In  practice,  the  algorithm  has  been 
observed  to  be  nearly  linear  (with  respect  to  T)  because  few 
iterations  are  required  (i.e.,  in  one  iteration  most/least  for 
many  intervals  decreases/increases  by  large  amounts  until  the 
minimum/maximum  is  reached). 

Compute_Most_Least(intervalSize)  ( 

apply  .constraints  (I)  | 

for  every  En  ^  f  constraint 

N  a  nLl/d  +  Min(mostn  mod  f],  n-least(f-2-(I  mod  01) 
most{I]  =  Min(  N,  most(l]); 

for  every  i  g  constraint 

N  a  mLl/gJ+Max(least(I  mod  gl,m-most(g-2-(I  mod  g)l) 
least(I]  a  Max(  N,  least(Il);  ) 

if  (we  haven't  initialized) 
check  for  well-formed  constraints  (Theorem  1  ); 

T  a  Max(  Max(aii  EnH)f-  MaxjaU  Em^g)8): 
for  I  a  0  to  T 

most(I]  a  infinity;  least[I]  a  0; 
repeat 

for  I  a  0  to  T 
apply-constraints(l) 
until  (contents  of  arrays  stop  changing) 

3pply_constrainls(intervalSize) 

return  mo$t(intervalSize}  and  least(intervalSize|  ) 

The  techniques  developed  above  can  also  be  used  to  size 
blocking  queues.  The  only  difference  for  blocking  queues  is 
that  constraints  on  receive  events  have  an  altered  semantics.  If 
1  receive  event  blocks,  no  more  receive  events  will  occur  until 
a  send  event  occurs.  Timing  constraints  need  to  express 
information  about  the  amount  of  time  that  can  pass  from  an 
unblocked  receive  to  the  next  receive  event.  Also  of 
importance  is  the  amount  of  time  that  can  pass  from  a  send  that 
dispatches  the  blocked  receive  event  to  the  next  receive.  These 
constraints  can  be  specified  within  the  OEgraph  framework.  In 
either  type  of  queue,  the  constraints  refer  only  to  the  first  next 
receive  event,  and  it  is  likely  that  the  separation  times  are 
identical  in  both  cases.  Therefore,  we  use  exactly  the  analysis 
presented  above  with  a  different  semantics  for  the  timing 
constraints  on  R  events. 

IV.  Example  Queue  Sizing 

We  consider  an  example  in  which  send  and  receive  events 
arrive  asynchronously  to  a  non-blocking  queue  which  is  used  to 
buffer  data  being  transferred  between  processes.  Send  events 


represent  either  message  headers  or  individual  pieces  of 
message  data.  The  fastest  separation  time  between  a  header  or  a 
piece  of  data  and  the  next  consecutive  piece  of  message  data  is 
specified  as  S)  2  3.  The  slowest  worst  case  separation  time 
between  consecutive  receive  events  is  a  given,  Ri  ^  5. 
Unfortunately,  these  worst  case  constraints  are  not  sufficient  to 
bound  the  size  of  the  queue  because  send  events  can  arrive  on 
average  more  often  than  receive  events. 

However,  there  is  a  large  delay  between  consecutive 
messages  which  can  be  specified  as  Sioo  2  750  (e.g.,  messages 
consist  of  at  most  98  pieces  of  data  and  one  header).  This  is 
sufficient  to  compute  a  worst-case  queue  size  of  41  (for  an 
interval  of  size  297)  after  considering  intervals  up  to  501  in 
size.  The  algorithm  does  not  consider  larger  intervals  because 
at  501  the  number  of  receives  is  equal  to  the  number  of  sends 
(100)  and  the  queue  is  guaranteed  to  have  been  emptied. 

If  it  was  known  that  receive  events  will  be  occurring  more 
frequently  than  the  worst  case  separation  (e.g.,  interrupts  in  the 
receiving  process  can  delay  receive  events  but  the  interrupts 
themselves  do  not  occur  very  often)  then  the  queue  size  can  be 
decreased.  For  example,  R50  £  200  results  in  a  worst-case 
queue  size  of  31  (for  an  interval  of  size  294)  after  considering 
intervals  up  to  401  in  size. 

V.  Conclusion 

Sizing  potentially  infinite  queues  is  an  important 
component  of  higher  level  synthesis  [5].  Placing  the  burden 
on  the  user  to  set  the  size  of  such  queues  is  an  unnecessary 
overhead  that  can  lead  to  inefficiencies  in  the  exploitation  of 
the  inherent  parallelism  in  the  specification  and  detracts  from 
the  modularity  and  reusability  of  the  specification. 

In  this  paper,  we  have  described  an  algorithm  for  sizing 
synchronization  queues  given  constraints  on  the  rate  of  send 
and  receive  events.  These  consaaints  are  either  given  or  can 
often  be  inferred  given  the  presence  of  other  constraints  on  the 
circuit’s  behavior  (e.g.,  the  circuit’s  propagation  delays).  In 
current  synthesis  systems,  the  user  must  explicitly  specify  the 
queue  size  and  then  ensure  that  it  does  not  overflow.  The 
^gorithm  presented  here  is  a  part  of  an  incremental  approach 
to  synthesis  being  developed  with  OEgraphs  as  a  foundation. 
Our  immediate  plans  are  to  extend  this  work  in  the  direction  of 
synthesizing  the  control  logic  for  the  physical  queue  that  will 
need  to  be  implemented  and  perform  tradeoffs  between  circuit 
parallelism  (slowing  down  the  rate  of  send  events  or  speeding 
up  the  rate  of  receive  events),  queue  size,  and  control  logic 
complexity. 
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Operation  on  Event  Graphs 
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